-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to UTF-8 statistics truncation #6870
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etseidl -- the tests in this PR are amazing
I am not sure about the logic to increment the utf8 bytes. Let me know what you think
parquet/src/column/writer/mod.rs
Outdated
// its available bits. If it was a continuation byte (b10xxxxxx) then set to min | ||
// continuation (b10000000). Otherwise it was the first byte so set reset the first | ||
// byte back to its original value (so data remains a valid string) and reduce "len". | ||
if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't super perf critical, can we switch over to operating on codepoints?
(i assume this is for stats only, so not a hot path?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't super perf critical, can we switch over to operating on codepoints? (i assume this is for stats only, so not a hot path?)
I agree switching to arithmetic on codepoints would be easier to reason about.
I double checked and this function is only called while writing stats (at most once per page, and once per column chunk ):
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20increment_utf8&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, here's what I came up with (stealing liberally from the code in apache/datafusion#12978 😄) that avoids as much translation/allocation as I could manage.
/// caller guarantees that data.len() > length
fn truncate_and_inc_utf8(data: &str, length: usize) -> Option<Vec<u8>> {
// UTF-8 is max 4 bytes, so start search 3 back from desired length
let lower_bound = length.saturating_sub(3);
let split = (lower_bound..=length).rfind(|x| data.is_char_boundary(*x))?;
increment_utf8_str(data.get(..split)?)
}
fn increment_utf8_str(data: &str) -> Option<Vec<u8>> {
for (idx, code_point) in data.char_indices().rev() {
let curr_len = code_point.len_utf8();
let original = code_point as u32;
if let Some(next_char) = char::from_u32(original + 1) {
// do not allow increasing byte width of incremented char
if next_char.len_utf8() == curr_len {
let mut result = data.as_bytes()[..idx + curr_len].to_vec();
next_char.encode_utf8(&mut result[idx..]);
return Some(result);
}
}
}
None
}
This winds up being a little faster than the byte-by-byte implementation in this PR in the latter's best case, and as much as 30X faster for the worst case.
The check for increasing byte count can be skipped if we don't care about overshooting a little. This also doesn't include the check for non-characters as all we care about is a valid UTF-8 bound...the bounds shouldn't be interpreted by readers anyway as they are not exact. The only case this doesn't handle is incrementing U+D7FF. The range U+D800..U+DFFF overlaps with UTF-16, so those code points are not valid characters. The byte-by-byte implementation will increment to the next valid code point of U+E000, whereas this code will punt and say U+D7FF cannot be incremented. This isn't really a practical concern, though, since the nearest assigned code point is U+D7FB.
@alamb @findepi if this looks reasonable I'll clean it up and push it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- I went through this logic quite carefully and I think it looks great.
- It only copies the string values once (like the current code)
- I am convinced it is correct ❤️ (hopefully those will not be famous last words)
So all in all, thank you and great job. Thank you
Err(_) => Some(data[..l].to_vec()), | ||
}) | ||
.and_then(|l| | ||
// don't do extra work if this column isn't UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
if self.is_utf8() { | ||
match str::from_utf8(data) { | ||
Ok(str_data) => truncate_utf8(str_data, l), | ||
Err(_) => Some(data[..l].to_vec()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a somewhat questionable move to truncate this on invalid data, but I see that is wht the code used to do so seems good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. The old code simply tried utf first, and then fell back. Here we're actually expecting valid UTF8 so perhaps it's better to return an error. I'd hope some string validation was done before getting this far. I'll think on this some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave it as is and maybe document that if non utf8 data is passed in it will be truncated with bytes
let curr_len = code_point.len_utf8(); | ||
let original = code_point as u32; | ||
if let Some(next_char) = char::from_u32(original + 1) { | ||
// do not allow increasing byte width of incremented char |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it is never the case where next_char.len_utf8() is going to be shorter than the current length as utf8 encodings of a larger codepoint will always be at least as large 🤔
I guess I am thinking should this be an inequality. However, I think it is easy to reason about the invariants if it is know to be equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's no way incrementing a valid character will overflow a u32, so we can assume it only grows. I suppose we could change the test to something like if idx + next_char.len_utf8() <= data.len()
. That way if we've already removed some characters, creating space under the truncation limit, we can then afford the character growing by a byte. If we want to take that tack, then we should probably pass in the truncation length as we may have already gone under the limit due to not splitting a character.
I think that's maybe being too fussy and would prefer to keep this simple. Thoughts?
parquet/src/column/writer/mod.rs
Outdated
fn increment_utf8(data: &str) -> Option<Vec<u8>> { | ||
for (idx, code_point) in data.char_indices().rev() { | ||
let curr_len = code_point.len_utf8(); | ||
let original = code_point as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pedantic me would likely rename original
--> original_char
to mirror the naming of next_char
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shall out-pedant you and point out that original
is a u32
while next_char
is a char
😄. I'm reworking this to eliminate the need for original
and will rename things to be consistent.
parquet/src/column/writer/mod.rs
Outdated
let r = truncate_and_increment_utf8("𐀀𐀀𐀀", 8).unwrap(); | ||
assert_eq!(&r, "𐀀𐀁".as_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test for incrementing that doesn't have space for 2 character (aka exercise the loop twice)
Maybe something like truncate this to 5 bytes
truncate_and_increment_utf8("𐀀𐀀𐀀", 5).unwrap();
Which could only hold a single 4 byte UTF8 code point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, no, I don't think there is a test like that. There is a truncation test that does, but not followed by increment. I'll modify some as you suggest.
Which issue does this PR close?
Closes #6867.
Rationale for this change
See issue.
What changes are included in this PR?
For max statistics replaces
truncate_utf8().and_then(increment_utf8)
with a new functiontruncate_and_increment_utf8()
. This defers the creation of a newVec<u8>
until all processing is complete. This also changesincrement_utf8
to operate on entire unicode code points rather than doing arithmetic on the individual UTF-8 encoded bytes. Finally, this modifies the truncation logic so that UTF-8 handling is only done for columns whose logical type isString
(or converted typeUTF8
).The new increment logic is up to 30X faster for pathological cases of strings that cannot be truncated, and is no slower than the current code for simple cases where only the last byte of a string needs to be incremented.
Are there any user-facing changes?
No API changes, but will potentially produce different truncated max statistics.