Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to UTF-8 statistics truncation #6870

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Dec 11, 2024

Which issue does this PR close?

Closes #6867.

Rationale for this change

See issue.

What changes are included in this PR?

For max statistics replaces truncate_utf8().and_then(increment_utf8) with a new function truncate_and_increment_utf8(). This defers the creation of a new Vec<u8> until all processing is complete. This also changes increment_utf8 to operate on entire unicode code points rather than doing arithmetic on the individual UTF-8 encoded bytes. Finally, this modifies the truncation logic so that UTF-8 handling is only done for columns whose logical type is String (or converted type UTF8).

The new increment logic is up to 30X faster for pathological cases of strings that cannot be truncated, and is no slower than the current code for simple cases where only the last byte of a string needs to be incremented.

Are there any user-facing changes?

No API changes, but will potentially produce different truncated max statistics.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 11, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etseidl -- the tests in this PR are amazing

I am not sure about the logic to increment the utf8 bytes. Let me know what you think

parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Show resolved Hide resolved
parquet/src/column/writer/mod.rs Show resolved Hide resolved
parquet/src/column/writer/mod.rs Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
// its available bits. If it was a continuation byte (b10xxxxxx) then set to min
// continuation (b10000000). Otherwise it was the first byte so set reset the first
// byte back to its original value (so data remains a valid string) and reduce "len".
if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't super perf critical, can we switch over to operating on codepoints?
(i assume this is for stats only, so not a hot path?)

Copy link
Contributor

@alamb alamb Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this isn't super perf critical, can we switch over to operating on codepoints? (i assume this is for stats only, so not a hot path?)

I agree switching to arithmetic on codepoints would be easier to reason about.

I double checked and this function is only called while writing stats (at most once per page, and once per column chunk ):
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20increment_utf8&type=code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, here's what I came up with (stealing liberally from the code in apache/datafusion#12978 😄) that avoids as much translation/allocation as I could manage.

/// caller guarantees that data.len() > length
fn truncate_and_inc_utf8(data: &str, length: usize) -> Option<Vec<u8>> {
    // UTF-8 is max 4 bytes, so start search 3 back from desired length
    let lower_bound = length.saturating_sub(3);
    let split = (lower_bound..=length).rfind(|x| data.is_char_boundary(*x))?;
    increment_utf8_str(data.get(..split)?)
}

fn increment_utf8_str(data: &str) -> Option<Vec<u8>> {
    for (idx, code_point) in data.char_indices().rev() {
        let curr_len = code_point.len_utf8();
        let original = code_point as u32;
        if let Some(next_char) = char::from_u32(original + 1) {
            // do not allow increasing byte width of incremented char
            if next_char.len_utf8() == curr_len {
                let mut result = data.as_bytes()[..idx + curr_len].to_vec();
                next_char.encode_utf8(&mut result[idx..]);
                return Some(result);
            }
        }
    }
    None
}

This winds up being a little faster than the byte-by-byte implementation in this PR in the latter's best case, and as much as 30X faster for the worst case.

The check for increasing byte count can be skipped if we don't care about overshooting a little. This also doesn't include the check for non-characters as all we care about is a valid UTF-8 bound...the bounds shouldn't be interpreted by readers anyway as they are not exact. The only case this doesn't handle is incrementing U+D7FF. The range U+D800..U+DFFF overlaps with UTF-16, so those code points are not valid characters. The byte-by-byte implementation will increment to the next valid code point of U+E000, whereas this code will punt and say U+D7FF cannot be incremented. This isn't really a practical concern, though, since the nearest assigned code point is U+D7FB.

@alamb @findepi if this looks reasonable I'll clean it up and push it.

@etseidl etseidl changed the title Fix some edge cases in UTF-8 incrementing Improvements to UTF-8 statistics truncation Dec 13, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I went through this logic quite carefully and I think it looks great.

  1. It only copies the string values once (like the current code)
  2. I am convinced it is correct ❤️ (hopefully those will not be famous last words)

So all in all, thank you and great job. Thank you

Err(_) => Some(data[..l].to_vec()),
})
.and_then(|l|
// don't do extra work if this column isn't UTF-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

if self.is_utf8() {
match str::from_utf8(data) {
Ok(str_data) => truncate_utf8(str_data, l),
Err(_) => Some(data[..l].to_vec()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a somewhat questionable move to truncate this on invalid data, but I see that is wht the code used to do so seems good to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. The old code simply tried utf first, and then fell back. Here we're actually expecting valid UTF8 so perhaps it's better to return an error. I'd hope some string validation was done before getting this far. I'll think on this some more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave it as is and maybe document that if non utf8 data is passed in it will be truncated with bytes

let curr_len = code_point.len_utf8();
let original = code_point as u32;
if let Some(next_char) = char::from_u32(original + 1) {
// do not allow increasing byte width of incremented char
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it is never the case where next_char.len_utf8() is going to be shorter than the current length as utf8 encodings of a larger codepoint will always be at least as large 🤔

I guess I am thinking should this be an inequality. However, I think it is easy to reason about the invariants if it is know to be equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's no way incrementing a valid character will overflow a u32, so we can assume it only grows. I suppose we could change the test to something like if idx + next_char.len_utf8() <= data.len(). That way if we've already removed some characters, creating space under the truncation limit, we can then afford the character growing by a byte. If we want to take that tack, then we should probably pass in the truncation length as we may have already gone under the limit due to not splitting a character.

I think that's maybe being too fussy and would prefer to keep this simple. Thoughts?

fn increment_utf8(data: &str) -> Option<Vec<u8>> {
for (idx, code_point) in data.char_indices().rev() {
let curr_len = code_point.len_utf8();
let original = code_point as u32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantic me would likely rename original --> original_char to mirror the naming of next_char

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shall out-pedant you and point out that original is a u32 while next_char is a char 😄. I'm reworking this to eliminate the need for original and will rename things to be consistent.

Comment on lines 3293 to 3294
let r = truncate_and_increment_utf8("𐀀𐀀𐀀", 8).unwrap();
assert_eq!(&r, "𐀀𐀁".as_bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test for incrementing that doesn't have space for 2 character (aka exercise the loop twice)

Maybe something like truncate this to 5 bytes

truncate_and_increment_utf8("𐀀𐀀𐀀", 5).unwrap();

Which could only hold a single 4 byte UTF8 code point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no, I don't think there is a test like that. There is a truncation test that does, but not followed by increment. I'll modify some as you suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet UTF-8 max statistics are overly pessimistic
3 participants