Improvements to UTF-8 statistics truncation #6870

etseidl · 2024-12-11T17:15:51Z

Which issue does this PR close?

Closes #6867.

Rationale for this change

See issue.

What changes are included in this PR?

For max statistics replaces truncate_utf8().and_then(increment_utf8) with a new function truncate_and_increment_utf8(). This defers the creation of a new Vec<u8> until all processing is complete. This also changes increment_utf8 to operate on entire unicode code points rather than doing arithmetic on the individual UTF-8 encoded bytes. Finally, this modifies the truncation logic so that UTF-8 handling is only done for columns whose logical type is String (or converted type UTF8).

The new increment logic is up to 30X faster for pathological cases of strings that cannot be truncated, and is no slower than the current code for simple cases where only the last byte of a string needs to be incremented.

Are there any user-facing changes?

No API changes, but will potentially produce different truncated max statistics.

alamb

Thanks @etseidl -- the tests in this PR are amazing

I am not sure about the logic to increment the utf8 bytes. Let me know what you think

parquet/src/column/writer/mod.rs

findepi · 2024-12-12T13:59:07Z

parquet/src/column/writer/mod.rs

+            // its available bits. If it was a continuation byte (b10xxxxxx) then set to min
+            // continuation (b10000000). Otherwise it was the first byte so set reset the first
+            // byte back to its original value (so data remains a valid string) and reduce "len".
+            if original & UTF8_CONTINUATION_MASK == UTF8_CONTINUATION {


If this isn't super perf critical, can we switch over to operating on codepoints?
(i assume this is for stats only, so not a hot path?)

If this isn't super perf critical, can we switch over to operating on codepoints? (i assume this is for stats only, so not a hot path?)

I agree switching to arithmetic on codepoints would be easier to reason about.

I double checked and this function is only called while writing stats (at most once per page, and once per column chunk ):
https://github.com/search?q=repo%3Aapache%2Farrow-rs%20increment_utf8&type=code

Ok, here's what I came up with (stealing liberally from the code in apache/datafusion#12978 😄) that avoids as much translation/allocation as I could manage.

/// caller guarantees that data.len() > length fn truncate_and_inc_utf8(data: &str, length: usize) -> Option<Vec<u8>> { // UTF-8 is max 4 bytes, so start search 3 back from desired length let lower_bound = length.saturating_sub(3); let split = (lower_bound..=length).rfind(|x| data.is_char_boundary(*x))?; increment_utf8_str(data.get(..split)?) } fn increment_utf8_str(data: &str) -> Option<Vec<u8>> { for (idx, code_point) in data.char_indices().rev() { let curr_len = code_point.len_utf8(); let original = code_point as u32; if let Some(next_char) = char::from_u32(original + 1) { // do not allow increasing byte width of incremented char if next_char.len_utf8() == curr_len { let mut result = data.as_bytes()[..idx + curr_len].to_vec(); next_char.encode_utf8(&mut result[idx..]); return Some(result); } } } None }

This winds up being a little faster than the byte-by-byte implementation in this PR in the latter's best case, and as much as 30X faster for the worst case.

The check for increasing byte count can be skipped if we don't care about overshooting a little. This also doesn't include the check for non-characters as all we care about is a valid UTF-8 bound...the bounds shouldn't be interpreted by readers anyway as they are not exact. The only case this doesn't handle is incrementing U+D7FF. The range U+D800..U+DFFF overlaps with UTF-16, so those code points are not valid characters. The byte-by-byte implementation will increment to the next valid code point of U+E000, whereas this code will punt and say U+D7FF cannot be incremented. This isn't really a practical concern, though, since the nearest assigned code point is U+D7FB.

@alamb @findepi if this looks reasonable I'll clean it up and push it.

…hema

alamb

Thank you @etseidl -- I went through this logic quite carefully and I think it looks great.

It only copies the string values once (like the current code)
I am convinced it is correct ❤️ (hopefully those will not be famous last words)

So all in all, thank you and great job. Thank you

alamb · 2024-12-14T01:05:49Z

parquet/src/column/writer/mod.rs

-                Err(_) => Some(data[..l].to_vec()),
-            })
+            .and_then(|l|
+                // don't do extra work if this column isn't UTF-8


alamb · 2024-12-14T01:45:02Z

parquet/src/column/writer/mod.rs

+                if self.is_utf8() {
+                    match str::from_utf8(data) {
+                        Ok(str_data) => truncate_utf8(str_data, l),
+                        Err(_) => Some(data[..l].to_vec()),


it is a somewhat questionable move to truncate this on invalid data, but I see that is wht the code used to do so seems good to me

Hmm, good point. The old code simply tried utf first, and then fell back. Here we're actually expecting valid UTF8 so perhaps it's better to return an error. I'd hope some string validation was done before getting this far. I'll think on this some more.

I think we should leave it as is and maybe document that if non utf8 data is passed in it will be truncated with bytes

alamb · 2024-12-14T01:47:42Z

parquet/src/column/writer/mod.rs

+        let curr_len = code_point.len_utf8();
+        let original = code_point as u32;
+        if let Some(next_char) = char::from_u32(original + 1) {
+            // do not allow increasing byte width of incremented char


I suppose it is never the case where next_char.len_utf8() is going to be shorter than the current length as utf8 encodings of a larger codepoint will always be at least as large 🤔

I guess I am thinking should this be an inequality. However, I think it is easy to reason about the invariants if it is know to be equal

Yes, there's no way incrementing a valid character will overflow a u32, so we can assume it only grows. I suppose we could change the test to something like if idx + next_char.len_utf8() <= data.len(). That way if we've already removed some characters, creating space under the truncation limit, we can then afford the character growing by a byte. If we want to take that tack, then we should probably pass in the truncation length as we may have already gone under the limit due to not splitting a character.

I think that's maybe being too fussy and would prefer to keep this simple. Thoughts?

alamb · 2024-12-14T01:51:33Z

parquet/src/column/writer/mod.rs

+fn increment_utf8(data: &str) -> Option<Vec<u8>> {
+    for (idx, code_point) in data.char_indices().rev() {
+        let curr_len = code_point.len_utf8();
+        let original = code_point as u32;


pedantic me would likely rename original --> original_char to mirror the naming of next_char

I shall out-pedant you and point out that original is a u32 while next_char is a char 😄. I'm reworking this to eliminate the need for original and will rename things to be consistent.

alamb · 2024-12-14T01:57:05Z

parquet/src/column/writer/mod.rs

+        let r = truncate_and_increment_utf8("𐀀𐀀𐀀", 8).unwrap();
+        assert_eq!(&r, "𐀀𐀁".as_bytes());


Is there a test for incrementing that doesn't have space for 2 character (aka exercise the loop twice)

Maybe something like truncate this to 5 bytes

truncate_and_increment_utf8("𐀀𐀀𐀀", 5).unwrap();

Which could only hold a single 4 byte UTF8 code point

Hmm, no, I don't think there is a test like that. There is a truncation test that does, but not followed by increment. I'll modify some as you suggest.

etseidl and others added 5 commits December 10, 2024 15:09

fix a few edge cases with utf-8 incrementing

0f7af02

add todo

52706f9

simplify truncation

80fa0dd

add another test

f1726ab

Merge branch 'apache:main' into increment_utf8

7b88e91

github-actions bot added the parquet Changes to the parquet crate label Dec 11, 2024

etseidl mentioned this pull request Dec 11, 2024

Parquet UTF-8 max statistics are overly pessimistic #6867

Open

alamb reviewed Dec 12, 2024

View reviewed changes

alamb mentioned this pull request Dec 12, 2024

Implement predicate pruning for like expressions (prefix matching) apache/datafusion#12978

Open

findepi reviewed Dec 12, 2024

View reviewed changes

etseidl added 2 commits December 12, 2024 10:03

note case where string should render right to left

c4d9474

rework entirely, also avoid UTF8 processing if not required by the sc…

7a7fd0e

…hema

etseidl changed the title ~~Fix some edge cases in UTF-8 incrementing~~ Improvements to UTF-8 statistics truncation Dec 13, 2024

alamb approved these changes Dec 14, 2024

View reviewed changes

etseidl added 2 commits December 14, 2024 09:04

more consistent naming

400f5f8

modify some tests to truncate in the middle of a multibyte char

006a388

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to UTF-8 statistics truncation #6870

Improvements to UTF-8 statistics truncation #6870

etseidl commented Dec 11, 2024 •

edited

Loading

alamb left a comment

findepi Dec 12, 2024

alamb Dec 12, 2024 •

edited

Loading

etseidl Dec 13, 2024

alamb left a comment

alamb Dec 14, 2024

alamb Dec 14, 2024

etseidl Dec 14, 2024

alamb Dec 14, 2024

alamb Dec 14, 2024

etseidl Dec 14, 2024

alamb Dec 14, 2024

etseidl Dec 14, 2024

alamb Dec 14, 2024

etseidl Dec 14, 2024

		let r = truncate_and_increment_utf8("𐀀𐀀𐀀", 8).unwrap();
		assert_eq!(&r, "𐀀𐀁".as_bytes());

Improvements to UTF-8 statistics truncation #6870

Are you sure you want to change the base?

Improvements to UTF-8 statistics truncation #6870

Conversation

etseidl commented Dec 11, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Dec 11, 2024 •

edited

Loading

alamb Dec 12, 2024 •

edited

Loading