-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graph,graphql: Query.hash to represent a normalized query #3840
Conversation
73149e8
to
33fd01f
Compare
I just had a brief look at that, and it looks like it's mostly rewriting the query which requires allocations; is there a way to write this as a visitor without allocations that passes a I'd need to look more into it, but it should be possible to (ab)use stable_hash and also avoid sorting by using it's |
@That3Percent I just looked at the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not totally against doing this, though I am very concerned about the amount of cloning that this hash calculation requires. The big question to me is how we find out whether the cloning matters - query node usage is noisy enough that it will be hard to spot a say 5% increase in CPU load caused by cloning, even though I'd consider that a lot of increase for a hash calculation.
graph/src/data/query/query.rs
Outdated
@@ -134,6 +135,7 @@ pub struct Query { | |||
pub document: q::Document, | |||
pub variables: Option<QueryVariables>, | |||
pub shape_hash: u64, | |||
pub hash: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this field something else or at least have a comment what the hash is for? I'd suggest validation_hash
to make it clear that this is a hash we calculate for validations.
graph/src/data/graphql/query_hash.rs
Outdated
use std::hash::{Hash, Hasher}; | ||
use std::marker::PhantomData; | ||
|
||
pub fn query_hash(query: &q::Document) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should have a comment that describes how this hash works, similar to what's in the description of this PR
I agree it is a good idea to avoid clones and sorting generally. I didn't carefully check this particular case, but a nested transform/clone loop has the danger of being N**2 complexity since the innermost items can be cloned depth times. There's a private API that we could expose for this purpose. The private API is here: https://github.com/graphprotocol/stable-hash/blob/7a3a2969e05a3e373dd53ec41786ad52b3887d17/src/impls/mod.rs#L13 and looks like this: pub(self) fn unordered_unique_stable_hash<H: StableHasher>(
items: impl Iterator<Item = impl StableHash>,
field_address: H::Addr,
state: &mut H,
) { ... } Would you like for struct Unordered<T>(T);
impl<T, I> StableHash for T where T: IntoIter<Item=I>, I: StableHash {..} Then you could do something like... Within a I think I would technically have to make a major version bump for this, because nothing documents that |
graph/src/data/graphql/query_hash.rs
Outdated
use std::hash::{Hash, Hasher}; | ||
use std::marker::PhantomData; | ||
|
||
pub fn query_hash(query: &q::Document) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, query_hash
needs to be at least a Level 2 hash: http://nohatcoder.dk/2019-05-19-1.html, keyed with a cryptographic-strength random number, and support a state space of at least u128
for security.
The reason for this is that an insecure hash would allow a competent attacker to force an indexer to attest to an incorrect result by finding a hash collision.
It's difficult to reconcile these two:
I might instead suggest sorting but not cloning. For example, you can buffer references to parts of the query using second-stack very cheaply and then sort the references. You can pass the hasher in and feed it pieces of the query as you go rather than ever calling |
For a Level 2 hash function, I'd recommend this one: XXH3 128 with secret using a 192 byte secret initialized as a |
@lutter I renamed these: - query_hash
+ validation_hash
- Query.hash
+ Query.validation_hash and took the iteration logic (visitor) from the @That3Percent I tried to use Before let mut next_difinitions = self.definitions.clone();
next_difinitions.sort_unstable_by(|a, b| compare_definitions(a, b));
for defn in &next_difinitions {
match defn {
q::Definition::Operation(operation) => operation.query_validation_hash(hasher),
q::Definition::Fragment(fragment) => fragment.query_validation_hash(hasher),
}
} After buffer(self.definitions.iter(), |slice| {
slice.sort_unstable_by(|a, b| compare_definitions(a, b));
for defn in slice {
match defn {
q::Definition::Operation(operation) => operation.query_validation_hash(hasher),
q::Definition::Fragment(fragment) => fragment.query_validation_hash(hasher),
}
}
}); |
The usage pattern in "After" is correct. The memory is freed once the |
bf8265c
to
bd7e7d7
Compare
Ready for a review |
// - transforming the selection sets like `query name { things }` into `{ things }` | ||
// - removing primitive values (like String, Int, Float except Boolean because it can change the body of the operation when used with `@include` or `@skip`) | ||
// - ignoring aliases | ||
pub fn validation_hash(query: &q::Document) -> u64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment still needs to be addressed: #3840 (comment) from #3840 (comment).
The docs for DefaultHasher
state:
The internal algorithm is not specified, and so it and its hashes should not be relied upon over releases.
So, we don't know if it's at least a level 2 hash.
We can see from the source that DefaultHasher::new()
hardcodes specific keys https://doc.rust-lang.org/src/std/collections/hash/map.rs.html#3155-3157
pub fn new() -> DefaultHasher {
DefaultHasher(SipHasher13::new_with_keys(0, 0))
}
This means that an attacker can find collisions and use those collisions to harm an Indexer.
For example, an attacker could search for two queries with the same hash - one which is valid and the other invalid. They would serve the invalid query first, followed by the valid query. When the invalid query is served it would create an entry in GRAPHQL_VALIDATION_CACHE
containing errors. The Indexer would correctly serve and attest to that result. Unfortunately, when the Indexer serves the second request it would incorrectly retrieve the validation errors because of the hash collision. It would then serve and attest to an error response for a query which should have validated. This can result in the indexer being slashed.
Using XXHash in the way suggested by the comment won't incur a significant performance penalty. Though, I should have pointed you to the streaming implementation here: https://docs.rs/xxhash-rust/0.8.5/xxhash_rust/xxh3/struct.Xxh3Builder.html#method.with_secret
let q2 = parse_query(Q2) | ||
.expect("q2 is syntactically valid") | ||
.into_static(); | ||
let q3 = parse_query(Q3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: By making a helper function with #[track_caller]
that does both parse_query
and validation_hash
you can make all the test cases more succinct.
|
||
impl QueryValidationHash for q::OperationDefinition { | ||
fn query_validation_hash(&self, hasher: &mut QueryValidationHasher) { | ||
// We want `[query|subscription|mutation] things { BODY }` to hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really? Any valid query is also a valid mutation and vice-versa?
|
||
assert_ne!(validation_hash(&q1), validation_hash(&q2)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test that should pass but fails:
#[test]
fn same_name_different_thing() {
const Q1: &str = "{ a { b } c { d } }";
const Q2: &str = "{ a { b c d } }";
let q1 = parse_query(Q1)
.expect("q1 is syntactically valid")
.into_static();
let q2 = parse_query(Q2)
.expect("q2 is syntactically valid")
.into_static();
assert_ne!(validation_hash(&q1), validation_hash(&q2));
}
Fails with:
---- data::graphql::validation_hash::tests::same_name_different_thing stdout ----
thread 'data::graphql::validation_hash::tests::same_name_different_thing' panicked at 'assertion failed: `(left != right)`
left: `14383803433003651353`,
right: `14383803433003651353`', graph/src/data/graphql/validation_hash.rs:385:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
The reason this fails is because the hash is not injective. I'd recommend reading EIP-712 for a primer on this concept: https://eips.ethereum.org/EIPS/eip-712 but the short of it is that both of these queries hash as "abcd" and make no distinction between types/nesting. We need to feed the rest of the syntax into the hash (like the "{" characters) so that these cases are disambiguated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the algo from the shape_hash
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then both should be fixed. cc @lutter
Closing for now, we'll restart this if needed. |
Uses
Query.hash
as the cache key of #3759The idea here is to avoid misses and increase the hit rate for the GraphQL validation's cache by normalizing GraphQL operations.
The normalization includes:
{ things }
intoquery normalized { things }
@include
or@skip
)