-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet UTF-8 max statistics are overly pessimistic #6867
Comments
cc @alamb, more fallout from dinner 🤣 |
Changing to enhancement as bound is technically still valid. If memory serves this logic was based on another implementation, but I can't remember precisely |
They are slightly different use cases, though. Here we're taking an N-character M-byte string and truncating it to no larger than T bytes, but it may be smaller due to character boundaries, and then incrementing the final character if possible. Since we're constrained by the size of the vector of bytes we're operating over, we can't promote a 2-byte character to a 3-byte, and so get a less ideal bound. What @adriangb et al are doing in datafusion is a bit different. There they have a prefix that they want to increment, but they're not constrained by size, so are free to switch to wider characters if necessary. We could do the same in parquet-rs if we were willing to have a truncated max statistic that's 1 byte larger than requested (which seems ok to me as long as it's communicated that the truncation is a best effort, just like with page and row group sizes). I've submitted #6870 which continues with the do-no-overshoot approach. If we want to relax the bounds a bit, then we could adopt what's being proposed in apache/datafusion#12978. I'd also like to do some testing to see if there are performance impacts (although I'd expect these to be minimal given the truncation happens at most once per page and column chunk). |
This seems fine with me Another approach might be to fallback to a one fewer characters if incrementing the truncated character at maybe something like if let Some(incremented) = increment_utf8(input) {
if incremented.len() > max_len {
increment_utf8(remove_last_char(input))
}
} |
Agreed as well. FWIW the way we use the truncation is "make sure we don't put an entire stacktrace in stats or something pointless like that". I don't think 1 byte more or less matters. |
Describe the bug
A recent comment regarding incrementing UTF-8 encoded characters got me wondering about the implementation in parquet-rs. It turns out that the
increment_utf8
function in theparquet::column::writer
module isn't optimal.Take for example the Unicode character
U+00FF
('ÿ'), which is encoded as0xC3BF
orb11000111 b10111111
. Incrementing this code point by one should yieldU+0100
(0xC480
or 'Ā').increment_utf8
, however, will yield insteadU+013F
('Ŀ'), due to how overflow is handled.arrow-rs/parquet/src/column/writer/mod.rs
Lines 1444 to 1451 in 08ac587
What happens in this case is the lower byte
0xBF
is incremented to0xC0
which is not a valid UTF-8 continuation character. The byte is then left alone and the next higher byte0xC3
is incremented to0xC4
, yielding a final UTF-8 encoded char of0xC4BF
which is UnicodeU+013F
. This is analogous to incrementing the high nybble of an ASCII character, for instance 'O' (0x4F
) becoming '_' (0x5F
) rather than 'P' (0x50
).The end result is still a valid upper bound, but it is not as good as it could be and could result in some row groups or pages being read when they don't actually need to be.
To Reproduce
Add the following to
test_increment_utf8
and it will pass.Expected behavior
The above test should fail, i.e. the incremented string should be "ÿĀ" rather than "ÿĿ".
Additional context
One possible fix would be to do something more akin to what is being done in apache/datafusion#12978, where the string data is first converted to chars, incremented, and then converted back to UTF-8. This can result in some characters increasing in size, which may not be desired here considering we're trying to truncate to less than a certain number of bytes. Another alternative is to set overflowed continuation bytes to
0x80
(the minimum valid continuation byte) when moving on to the next higher byte. If the character is at a byte width boundary, the bytes for that character can be set to0
and the next higher character can be incremented. This would not increase the byte count, but can result in less optimal upper bounds (but still better than the current ones).The text was updated successfully, but these errors were encountered: