Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 86: Preserve schema field order #87

Merged
merged 14 commits into from
Sep 16, 2020
Merged

Issue 86: Preserve schema field order #87

merged 14 commits into from
Sep 16, 2020

Conversation

shiveshr
Copy link
Contributor

Signed-off-by: Shivesh Ranjan [email protected]

Change log description
Store the user supplied schema, although the comparison is performed with normalized form.

Purpose of the change
Fixes #86

What the code does
Stores the user supplied schema binary instead of the normalized form.
Since we check for schema's existence by comparing the sha 256 hash which has no practical chance of collision we will not compare the binaries if the hash matches.
This is because the lookup and comparison happens in the storage layer which does not have awareness of schema formats and parsers. Since the stored schema is non normalized, the business layer actually computes the normalized form and also computes the fingerprint and shares it with the storage layer.

How to verify it
Unit test added.

@shiveshr shiveshr requested review from ravisharda and fpj and removed request for ravisharda July 29, 2020 12:58
@fpj fpj changed the title Issue 86: Store non normalized schema so that users can retrieve the schema in the form they stored it in Issue 86: Preserve schema order Jul 30, 2020
@fpj
Copy link
Contributor

fpj commented Jul 31, 2020

On this comment in the PR description:

Since we check for schema's existence by comparing the sha 256 hash which has no practical chance of collision we will not compare the binaries if the hash matches.

I expect the hashes to match rarely, including legitimate reasons, and so comparing binaries when they do match doesn't seem to induce a high cost, so I'd suggest to do it to be safe.

@shiveshr
Copy link
Contributor Author

shiveshr commented Jul 31, 2020

I expect the hashes to match rarely, including legitimate reasons, and so comparing binaries when they do match doesn't seem to induce a high cost, so I'd suggest to do it to be safe.

But the hash is computed on normalized form and we are storing the schema in the original form.
So comparing binaries is insufficient.
And since storage layer doesnt know how to parse the schema but treats it as raw bytes, we can not compare them directly.
So as a fix what i have done is to is pass a comparator function to storage layer that can parse and normalize the binary and then compare if the hashes match

@shiveshr
Copy link
Contributor Author

@fpj
I am not sure if the intent of what we are doing is clear so i will explain here:

  1. we want to store schemas for a group such that if a schema is already added, we dont want to add it again in the group.
  2. we want to store the schema in a global pool of schemas as well

There are three apis where this is relevant:

  1. addSchema(schemaInfo)
  2. getSchemaVersion(schemaInfo)
  3. getSchemaReferences(schemaInfo)

So the original problem we were solving was - how do we compare and say two schemas are equal logically even if the user supplied schema binaries may be structured differently. To explain - for avro and json, where users supply schema strings, two schemas are considered equal even if the order of fields in the schema string was different or if the schema string contained extra whitespace chars.
So logically we want to parse schemas and compare the parsed forms.

Prior to this PR, instead of parsing and comparing, we were normalizing the schema and storing the normalized form in its binary representation. that had the advantage that we only had to normalize the incoming user schema and then from that point on all operations and queries were on the normalized form. And then we did not need to parse the schema for comparing, we could just compare the normalized binaries.

However, the requirement we are trying to incorporate is to store and return the schema as supplied by the user. While also ensuring that the comparison with a differently organized but logically identical schema should still treat them as identical.

So what we are doing is we are storing the user supplied binary form as is. And the service layer passes a comarator function which can be used by the storage layer to compare the stored schema binary with requested schema binary by parsing both.

Signed-off-by: Shivesh Ranjan <[email protected]>
@shiveshr shiveshr changed the title Issue 86: Preserve schema order Issue 86: Preserve schema field order Sep 2, 2020
Signed-off-by: Shivesh Ranjan <[email protected]>
* Issue 103: Class cast Exception

Signed-off-by: Shivesh Ranjan <[email protected]>

* Add README.md

Signed-off-by: Shivesh Ranjan <[email protected]>

* Update comment

Signed-off-by: Shivesh Ranjan <[email protected]>

* license

Signed-off-by: Shivesh Ranjan <[email protected]>

* PR comment

Signed-off-by: Shivesh Ranjan <[email protected]>
@chipmaurer
Copy link

Tested ECS Presto connector with this branch, and was successful. OK to merge.

@fpj fpj merged commit cea7bd1 into pravega:master Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preserve schema order for future retrievals
3 participants