Parquet UTF-8 max statistics are overly pessimistic #6867

etseidl · 2024-12-11T01:15:39Z

Describe the bug
A recent comment regarding incrementing UTF-8 encoded characters got me wondering about the implementation in parquet-rs. It turns out that the increment_utf8 function in the parquet::column::writer module isn't optimal.

Take for example the Unicode character U+00FF ('ÿ'), which is encoded as 0xC3BF or b11000111 b10111111. Incrementing this code point by one should yield U+0100 (0xC480 or 'Ā'). increment_utf8, however, will yield instead U+013F ('Ŀ'), due to how overflow is handled.

arrow-rs/parquet/src/column/writer/mod.rs

Lines 1444 to 1451 in 08ac587

    
           let original = data[idx]; 
        
           let (byte, overflow) = original.overflowing_add(1); 
        
           if !overflow { 
        
               data[idx] = byte; 
        
               if str::from_utf8(&data).is_ok() { 
        
                   return Some(data); 
        
               } 
        
               data[idx] = original;

What happens in this case is the lower byte 0xBF is incremented to 0xC0 which is not a valid UTF-8 continuation character. The byte is then left alone and the next higher byte 0xC3 is incremented to 0xC4, yielding a final UTF-8 encoded char of 0xC4BF which is Unicode U+013F. This is analogous to incrementing the high nybble of an ASCII character, for instance 'O' (0x4F) becoming '_' (0x5F) rather than 'P' (0x50).

The end result is still a valid upper bound, but it is not as good as it could be and could result in some row groups or pages being read when they don't actually need to be.

To Reproduce
Add the following to test_increment_utf8 and it will pass.

        let s = "\u{ff}\u{ff}";
        let s1 = "\u{ff}\u{100}";
        let s2 = "\u{ff}\u{13f}";
        let v = increment_utf8(s.as_bytes().to_vec()).unwrap();
        assert_eq!(v, s2.as_bytes());
        assert_ne!(v, s1.as_bytes());

Expected behavior
The above test should fail, i.e. the incremented string should be "ÿĀ" rather than "ÿĿ".

Additional context
One possible fix would be to do something more akin to what is being done in apache/datafusion#12978, where the string data is first converted to chars, incremented, and then converted back to UTF-8. This can result in some characters increasing in size, which may not be desired here considering we're trying to truncate to less than a certain number of bytes. Another alternative is to set overflowed continuation bytes to 0x80 (the minimum valid continuation byte) when moving on to the next higher byte. If the character is at a byte width boundary, the bytes for that character can be set to 0 and the next higher character can be incremented. This would not increase the byte count, but can result in less optimal upper bounds (but still better than the current ones).

The text was updated successfully, but these errors were encountered:

etseidl · 2024-12-11T01:16:28Z

cc @alamb, more fallout from dinner 🤣

tustvold · 2024-12-11T04:35:53Z

Changing to enhancement as bound is technically still valid.

If memory serves this logic was based on another implementation, but I can't remember precisely

mapleFU · 2024-12-11T05:03:49Z

https://github.com/apache/parquet-java/blob/4aeba6cb7348a8fb57d8585058d27f53ce592b48/parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/BinaryTruncator.java#L155

I guess

alamb · 2024-12-11T11:56:49Z

One possible fix would be to ...

It would be amazing if we figured out how to do this truncation / incrementing / decrementing correctly in once place (the parquet crate) and then just reused the same logic in datafusion

Thank you @etseidl @tustvold and @mapleFU for your insights

etseidl · 2024-12-11T17:46:14Z

It would be amazing if we figured out how to do this truncation / incrementing / decrementing correctly in once place (the parquet crate) and then just reused the same logic in datafusion

They are slightly different use cases, though. Here we're taking an N-character M-byte string and truncating it to no larger than T bytes, but it may be smaller due to character boundaries, and then incrementing the final character if possible. Since we're constrained by the size of the vector of bytes we're operating over, we can't promote a 2-byte character to a 3-byte, and so get a less ideal bound. What @adriangb et al are doing in datafusion is a bit different. There they have a prefix that they want to increment, but they're not constrained by size, so are free to switch to wider characters if necessary. We could do the same in parquet-rs if we were willing to have a truncated max statistic that's 1 byte larger than requested (which seems ok to me as long as it's communicated that the truncation is a best effort, just like with page and row group sizes).

I've submitted #6870 which continues with the do-no-overshoot approach. If we want to relax the bounds a bit, then we could adopt what's being proposed in apache/datafusion#12978. I'd also like to do some testing to see if there are performance impacts (although I'd expect these to be minimal given the truncation happens at most once per page and column chunk).

alamb · 2024-12-12T13:29:08Z

We could do the same in parquet-rs if we were willing to have a truncated max statistic that's 1 byte larger than requested (which seems ok to me as long as it's communicated that the truncation is a best effort, just like with page and row group sizes).

This seems fine with me

Another approach might be to fallback to a one fewer characters if incrementing the truncated character at T bytes would increase past T due to promotion.

maybe something like

if let Some(incremented) = increment_utf8(input) {
  if incremented.len() > max_len {
    increment_utf8(remove_last_char(input))
  }
}

adriangb · 2024-12-12T14:15:34Z

This seems fine with me

Agreed as well. FWIW the way we use the truncation is "make sure we don't put an entire stacktrace in stats or something pointless like that". I don't think 1 byte more or less matters.

etseidl added the bug label Dec 11, 2024

tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Dec 11, 2024

tustvold changed the title ~~Parquet UTF-8 max statistics are not properly incremented when truncated~~ Parquet UTF-8 max statistics are overly pessimistic Dec 11, 2024

etseidl linked a pull request Dec 11, 2024 that will close this issue

Improvements to UTF-8 statistics truncation #6870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet UTF-8 max statistics are overly pessimistic #6867

Parquet UTF-8 max statistics are overly pessimistic #6867

etseidl commented Dec 11, 2024

etseidl commented Dec 11, 2024

tustvold commented Dec 11, 2024

mapleFU commented Dec 11, 2024

alamb commented Dec 11, 2024

etseidl commented Dec 11, 2024

alamb commented Dec 12, 2024

adriangb commented Dec 12, 2024

Parquet UTF-8 max statistics are overly pessimistic #6867

Parquet UTF-8 max statistics are overly pessimistic #6867

Comments

etseidl commented Dec 11, 2024

etseidl commented Dec 11, 2024

tustvold commented Dec 11, 2024

mapleFU commented Dec 11, 2024

alamb commented Dec 11, 2024

etseidl commented Dec 11, 2024

alamb commented Dec 12, 2024

adriangb commented Dec 12, 2024