Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to UTF-8 statistics truncation #6870
base: main
Are you sure you want to change the base?
Improvements to UTF-8 statistics truncation #6870
Changes from 5 commits
0f7af02
52706f9
80fa0dd
f1726ab
7b88e91
c4d9474
7a7fd0e
400f5f8
006a388
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't super perf critical, can we switch over to operating on codepoints?
(i assume this is for stats only, so not a hot path?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree switching to arithmetic on codepoints would be easier to reason about.
I double checked and this function is only called while writing stats (at most once per page, and once per column chunk ):
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20increment_utf8&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, here's what I came up with (stealing liberally from the code in apache/datafusion#12978 😄) that avoids as much translation/allocation as I could manage.
This winds up being a little faster than the byte-by-byte implementation in this PR in the latter's best case, and as much as 30X faster for the worst case.
The check for increasing byte count can be skipped if we don't care about overshooting a little. This also doesn't include the check for non-characters as all we care about is a valid UTF-8 bound...the bounds shouldn't be interpreted by readers anyway as they are not exact. The only case this doesn't handle is incrementing U+D7FF. The range U+D800..U+DFFF overlaps with UTF-16, so those code points are not valid characters. The byte-by-byte implementation will increment to the next valid code point of U+E000, whereas this code will punt and say U+D7FF cannot be incremented. This isn't really a practical concern, though, since the nearest assigned code point is U+D7FB.
@alamb @findepi if this looks reasonable I'll clean it up and push it.