Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graph,graphql: Query.hash to represent a normalized query #3840

Closed
wants to merge 1 commit into from

Conversation

kamilkisiela
Copy link
Contributor

@kamilkisiela kamilkisiela commented Aug 15, 2022

Uses Query.hash as the cache key of #3759

The idea here is to avoid misses and increase the hit rate for the GraphQL validation's cache by normalizing GraphQL operations.

The normalization includes:

  • sorting of fields, fragments, arguments, fragment definitions etc
  • making operation names identical
  • transforming the selection sets like { things } into query normalized { things }
  • removing primitive values (like String, Int, Float except Boolean because it can change the body of the operation when used with @include or @skip)
  • removing aliases

@kamilkisiela kamilkisiela self-assigned this Aug 15, 2022
@kamilkisiela kamilkisiela force-pushed the kamil-query-normalization branch from 73149e8 to 33fd01f Compare August 15, 2022 13:37
@dotansimha dotansimha requested review from leoyvens and lutter August 16, 2022 13:40
@lutter
Copy link
Collaborator

lutter commented Aug 16, 2022

I just had a brief look at that, and it looks like it's mostly rewriting the query which requires allocations; is there a way to write this as a visitor without allocations that passes a Hasher around and just hashes things according to what's in the query, e.g., when it sees an Int, it adds a 0 to the hasher?

I'd need to look more into it, but it should be possible to (ab)use stable_hash and also avoid sorting by using it's unordered() functionality

@lutter
Copy link
Collaborator

lutter commented Aug 16, 2022

@That3Percent I just looked at the StableHash docs, and can't quite figure out the right way to do this with its API. Could you comment here how we could use it to avoid sorting? I.e., what's the right way to add a list of things to StableHash so that order of the things does not affect the hash?

Copy link
Collaborator

@lutter lutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not totally against doing this, though I am very concerned about the amount of cloning that this hash calculation requires. The big question to me is how we find out whether the cloning matters - query node usage is noisy enough that it will be hard to spot a say 5% increase in CPU load caused by cloning, even though I'd consider that a lot of increase for a hash calculation.

@@ -134,6 +135,7 @@ pub struct Query {
pub document: q::Document,
pub variables: Option<QueryVariables>,
pub shape_hash: u64,
pub hash: u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this field something else or at least have a comment what the hash is for? I'd suggest validation_hash to make it clear that this is a hash we calculate for validations.

use std::hash::{Hash, Hasher};
use std::marker::PhantomData;

pub fn query_hash(query: &q::Document) -> u64 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should have a comment that describes how this hash works, similar to what's in the description of this PR

@That3Percent
Copy link
Contributor

@lutter

I just looked at the StableHash docs, and can't quite figure out the right way to do this with its API. Could you comment here how we could use it to avoid sorting?

I agree it is a good idea to avoid clones and sorting generally. I didn't carefully check this particular case, but a nested transform/clone loop has the danger of being N**2 complexity since the innermost items can be cloned depth times.

There's a private API that we could expose for this purpose. The private API is here: https://github.com/graphprotocol/stable-hash/blob/7a3a2969e05a3e373dd53ec41786ad52b3887d17/src/impls/mod.rs#L13 and looks like this:

pub(self) fn unordered_unique_stable_hash<H: StableHasher>(
    items: impl Iterator<Item = impl StableHash>,
    field_address: H::Addr,
    state: &mut H,
) { ... }

Would you like for stable-hash to expose a wrapper like so...?

struct Unordered<T>(T);
impl<T, I> StableHash for T where T: IntoIter<Item=I>, I: StableHash {..}

Then you could do something like...
Unordered(&self.selection_set)).stable_hash(field_address, state);

Within a StableHash impl for a newtype like NormalizedQuery(&Query)

I think I would technically have to make a major version bump for this, because nothing documents that StableHasher need support multi-sets. The major version bump wouldn't affect the API, so graph-node could do a very simple upgrade.

use std::hash::{Hash, Hasher};
use std::marker::PhantomData;

pub fn query_hash(query: &q::Document) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, query_hash needs to be at least a Level 2 hash: http://nohatcoder.dk/2019-05-19-1.html, keyed with a cryptographic-strength random number, and support a state space of at least u128 for security.

The reason for this is that an insecure hash would allow a competent attacker to force an indexer to attest to an incorrect result by finding a hash collision.

@That3Percent
Copy link
Contributor

@lutter @kamilkisiela

It's difficult to reconcile these two:
Use stable hash to avoid cloning
Level 2 hash required for security

stable-hash gives you a choice of either a Level 1 hash or a Level 5 hash, but not a 2. If you choose the Level 5 hash it would come with a severe performance penalty. I'm not confident I can make a Level 2 implementation of StableHasher. I've got some ideas of how I might go about it, but don't want to have our security rest on unverified ideas.

I might instead suggest sorting but not cloning. For example, you can buffer references to parts of the query using second-stack very cheaply and then sort the references. You can pass the hasher in and feed it pieces of the query as you go rather than ever calling to_string I think you can make this quite fast but it will take some care. Let me know if the approach I'm describing is not clear, I can help if needed as well.

@That3Percent
Copy link
Contributor

That3Percent commented Aug 16, 2022

@lutter @kamilkisiela

For a Level 2 hash function, I'd recommend this one: XXH3 128 with secret using a 192 byte secret initialized as a lazy_static. It operates at RAM speeds and seemingly has no weaknesses.

@kamilkisiela
Copy link
Contributor Author

kamilkisiela commented Aug 18, 2022

@lutter I renamed these:

- query_hash
+ validation_hash

- Query.hash
+ Query.validation_hash

and took the iteration logic (visitor) from the shape_hash function.


@That3Percent I tried to use second-stack but I'm not sure if I use it correctly.
It feels wrong to wrap everything with a buffer. I tried to return the slice in the buffer function not sure if that's correct.

Before

let mut next_difinitions = self.definitions.clone();
next_difinitions.sort_unstable_by(|a, b| compare_definitions(a, b));

for defn in &next_difinitions {
  match defn {
    q::Definition::Operation(operation) => operation.query_validation_hash(hasher),
    q::Definition::Fragment(fragment) => fragment.query_validation_hash(hasher),
  }
}

After

buffer(self.definitions.iter(), |slice| {
  slice.sort_unstable_by(|a, b| compare_definitions(a, b));
  for defn in slice {
    match defn {
      q::Definition::Operation(operation) => operation.query_validation_hash(hasher),
      q::Definition::Fragment(fragment) => fragment.query_validation_hash(hasher),
    }
  }
});

@That3Percent
Copy link
Contributor

That3Percent commented Aug 18, 2022

@kamilkisiela

I tried to use second-stack but I'm not sure if I use it correctly.
It feels wrong to wrap everything with a buffer. I tried to return the slice in the buffer function not sure if that's correct.

The usage pattern in "After" is correct. The memory is freed once the buffer function returns, so it's not possible to return the slice. Wrapping in buffer is the only way to provide this fast API in safe Rust because we have to guarantee that buffers are freed in the reverse order they are created.

@kamilkisiela kamilkisiela force-pushed the kamil-query-normalization branch from bf8265c to bd7e7d7 Compare September 7, 2022 09:58
@kamilkisiela
Copy link
Contributor Author

Ready for a review

// - transforming the selection sets like `query name { things }` into `{ things }`
// - removing primitive values (like String, Int, Float except Boolean because it can change the body of the operation when used with `@include` or `@skip`)
// - ignoring aliases
pub fn validation_hash(query: &q::Document) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still needs to be addressed: #3840 (comment) from #3840 (comment).

The docs for DefaultHasher state:

The internal algorithm is not specified, and so it and its hashes should not be relied upon over releases.

So, we don't know if it's at least a level 2 hash.

We can see from the source that DefaultHasher::new() hardcodes specific keys https://doc.rust-lang.org/src/std/collections/hash/map.rs.html#3155-3157

pub fn new() -> DefaultHasher {
    DefaultHasher(SipHasher13::new_with_keys(0, 0))
}

This means that an attacker can find collisions and use those collisions to harm an Indexer.

For example, an attacker could search for two queries with the same hash - one which is valid and the other invalid. They would serve the invalid query first, followed by the valid query. When the invalid query is served it would create an entry in GRAPHQL_VALIDATION_CACHE containing errors. The Indexer would correctly serve and attest to that result. Unfortunately, when the Indexer serves the second request it would incorrectly retrieve the validation errors because of the hash collision. It would then serve and attest to an error response for a query which should have validated. This can result in the indexer being slashed.

Using XXHash in the way suggested by the comment won't incur a significant performance penalty. Though, I should have pointed you to the streaming implementation here: https://docs.rs/xxhash-rust/0.8.5/xxhash_rust/xxh3/struct.Xxh3Builder.html#method.with_secret

let q2 = parse_query(Q2)
.expect("q2 is syntactically valid")
.into_static();
let q3 = parse_query(Q3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: By making a helper function with #[track_caller] that does both parse_query and validation_hash you can make all the test cases more succinct.


impl QueryValidationHash for q::OperationDefinition {
fn query_validation_hash(&self, hasher: &mut QueryValidationHasher) {
// We want `[query|subscription|mutation] things { BODY }` to hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? Any valid query is also a valid mutation and vice-versa?


assert_ne!(validation_hash(&q1), validation_hash(&q2));
}

Copy link
Contributor

@That3Percent That3Percent Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test that should pass but fails:

#[test]
fn same_name_different_thing() {
    const Q1: &str = "{ a { b } c { d } }";
    const Q2: &str = "{ a { b c d } }";
    let q1 = parse_query(Q1)
        .expect("q1 is syntactically valid")
        .into_static();
    let q2 = parse_query(Q2)
        .expect("q2 is syntactically valid")
        .into_static();

    assert_ne!(validation_hash(&q1), validation_hash(&q2));
}

Fails with:

---- data::graphql::validation_hash::tests::same_name_different_thing stdout ----
thread 'data::graphql::validation_hash::tests::same_name_different_thing' panicked at 'assertion failed: `(left != right)`
  left: `14383803433003651353`,
 right: `14383803433003651353`', graph/src/data/graphql/validation_hash.rs:385:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The reason this fails is because the hash is not injective. I'd recommend reading EIP-712 for a primer on this concept: https://eips.ethereum.org/EIPS/eip-712 but the short of it is that both of these queries hash as "abcd" and make no distinction between types/nesting. We need to feed the rest of the syntax into the hash (like the "{" characters) so that these cases are disambiguated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the algo from the shape_hash function

Copy link
Contributor

@That3Percent That3Percent Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then both should be fixed. cc @lutter

@dotansimha
Copy link
Contributor

Closing for now, we'll restart this if needed.

@dotansimha dotansimha closed this Sep 29, 2022
@saihaj saihaj deleted the kamil-query-normalization branch December 28, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants