Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet UTF-8 max statistics are overly pessimistic #6867

Open
etseidl opened this issue Dec 11, 2024 · 7 comments · May be fixed by #6870
Open

Parquet UTF-8 max statistics are overly pessimistic #6867

etseidl opened this issue Dec 11, 2024 · 7 comments · May be fixed by #6870
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@etseidl
Copy link
Contributor

etseidl commented Dec 11, 2024

Describe the bug
A recent comment regarding incrementing UTF-8 encoded characters got me wondering about the implementation in parquet-rs. It turns out that the increment_utf8 function in the parquet::column::writer module isn't optimal.

Take for example the Unicode character U+00FF ('ÿ'), which is encoded as 0xC3BF or b11000111 b10111111. Incrementing this code point by one should yield U+0100 (0xC480 or 'Ā'). increment_utf8, however, will yield instead U+013F ('Ŀ'), due to how overflow is handled.

let original = data[idx];
let (byte, overflow) = original.overflowing_add(1);
if !overflow {
data[idx] = byte;
if str::from_utf8(&data).is_ok() {
return Some(data);
}
data[idx] = original;

What happens in this case is the lower byte 0xBF is incremented to 0xC0 which is not a valid UTF-8 continuation character. The byte is then left alone and the next higher byte 0xC3 is incremented to 0xC4, yielding a final UTF-8 encoded char of 0xC4BF which is Unicode U+013F. This is analogous to incrementing the high nybble of an ASCII character, for instance 'O' (0x4F) becoming '_' (0x5F) rather than 'P' (0x50).

The end result is still a valid upper bound, but it is not as good as it could be and could result in some row groups or pages being read when they don't actually need to be.

To Reproduce
Add the following to test_increment_utf8 and it will pass.

        let s = "\u{ff}\u{ff}";
        let s1 = "\u{ff}\u{100}";
        let s2 = "\u{ff}\u{13f}";
        let v = increment_utf8(s.as_bytes().to_vec()).unwrap();
        assert_eq!(v, s2.as_bytes());
        assert_ne!(v, s1.as_bytes());

Expected behavior
The above test should fail, i.e. the incremented string should be "ÿĀ" rather than "ÿĿ".

Additional context
One possible fix would be to do something more akin to what is being done in apache/datafusion#12978, where the string data is first converted to chars, incremented, and then converted back to UTF-8. This can result in some characters increasing in size, which may not be desired here considering we're trying to truncate to less than a certain number of bytes. Another alternative is to set overflowed continuation bytes to 0x80 (the minimum valid continuation byte) when moving on to the next higher byte. If the character is at a byte width boundary, the bytes for that character can be set to 0 and the next higher character can be incremented. This would not increase the byte count, but can result in less optimal upper bounds (but still better than the current ones).

@etseidl etseidl added the bug label Dec 11, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Dec 11, 2024

cc @alamb, more fallout from dinner 🤣

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Dec 11, 2024
@tustvold tustvold changed the title Parquet UTF-8 max statistics are not properly incremented when truncated Parquet UTF-8 max statistics are overly pessimistic Dec 11, 2024
@tustvold
Copy link
Contributor

Changing to enhancement as bound is technically still valid.

If memory serves this logic was based on another implementation, but I can't remember precisely

@mapleFU
Copy link
Member

mapleFU commented Dec 11, 2024

@alamb
Copy link
Contributor

alamb commented Dec 11, 2024

One possible fix would be to ...

It would be amazing if we figured out how to do this truncation / incrementing / decrementing correctly in once place (the parquet crate) and then just reused the same logic in datafusion

Thank you @etseidl @tustvold and @mapleFU for your insights

@etseidl etseidl linked a pull request Dec 11, 2024 that will close this issue
@etseidl
Copy link
Contributor Author

etseidl commented Dec 11, 2024

It would be amazing if we figured out how to do this truncation / incrementing / decrementing correctly in once place (the parquet crate) and then just reused the same logic in datafusion

They are slightly different use cases, though. Here we're taking an N-character M-byte string and truncating it to no larger than T bytes, but it may be smaller due to character boundaries, and then incrementing the final character if possible. Since we're constrained by the size of the vector of bytes we're operating over, we can't promote a 2-byte character to a 3-byte, and so get a less ideal bound. What @adriangb et al are doing in datafusion is a bit different. There they have a prefix that they want to increment, but they're not constrained by size, so are free to switch to wider characters if necessary. We could do the same in parquet-rs if we were willing to have a truncated max statistic that's 1 byte larger than requested (which seems ok to me as long as it's communicated that the truncation is a best effort, just like with page and row group sizes).

I've submitted #6870 which continues with the do-no-overshoot approach. If we want to relax the bounds a bit, then we could adopt what's being proposed in apache/datafusion#12978. I'd also like to do some testing to see if there are performance impacts (although I'd expect these to be minimal given the truncation happens at most once per page and column chunk).

@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

We could do the same in parquet-rs if we were willing to have a truncated max statistic that's 1 byte larger than requested (which seems ok to me as long as it's communicated that the truncation is a best effort, just like with page and row group sizes).

This seems fine with me

Another approach might be to fallback to a one fewer characters if incrementing the truncated character at T bytes would increase past T due to promotion.

maybe something like

if let Some(incremented) = increment_utf8(input) {
  if incremented.len() > max_len {
    increment_utf8(remove_last_char(input))
  }
}

@adriangb
Copy link
Contributor

This seems fine with me

Agreed as well. FWIW the way we use the truncation is "make sure we don't put an entire stacktrace in stats or something pointless like that". I don't think 1 byte more or less matters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants