-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSSTCompressor #664
FSSTCompressor #664
Conversation
Ran the compress_taxi benchmark, got ~80% slower. I am a bit surprised that the biggest culprit seems to be creating new counters in the FSST training loop. That doesn't even scale w.r.t. to the size of the input array, it's just a flat 2MB allocation. The zeroing of the vector seems to be the biggest problem. I think we can avoid that with a second bitmap, let me try that out |
Alright, using the change in spiraldb/fsst#21 helped a lot. New benchmark result:
Which is about 10ms or ~11% slower than running without FSST. |
And I think we can go even lower, ideally we'd just use the trained compressor over the samples to compress the full array |
Just bear in mind that the samples can be very small compared to data, i.e. 1024 elements. I would say just retrain it |
Ok I've done a few things today
|
Ok I added a new benchmark now which just compresses the comments column in-memory via Vortex, and i'm seeing it take ~500ms, which is roughly 2-3x longer than just doing the compression without Vortex. I think the root of the performance difference is the chunking. Here's a comparison between running FSST over the comments column chunked as per our TPC loading infra (nchunks=192) and the canonicalized version of the comments array, which is not chunked: So somewhere I guess there's some fixed-size overhead in FSST training (probably a combo of allocations and double-tight-loops over 0...511) that when you try and run FSST hundreds of times, they start to add up and can skew your results. I'm curious how DuckDB and other folks deal with FSST + chunking, it seems like we might want to treat it as a special thing that can do its own sampling + have shared symbol table across chunks |
I'm currently blocking this on some work in spiraldb/fsst#24 |
encodings/fsst/src/array.rs
Outdated
// so we transmute to kill the lifetime complaints. | ||
// This is fine because the returned `Decompressor`'s lifetime is tied to the lifetime | ||
// of these same arrays. | ||
let symbol_lengths = unsafe { std::mem::transmute::<&[u8], &[u8]>(symbol_lengths) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious for a sanity check here, or if there's another way i should be doing this. it feels a bit wrong, but I think it is currently the best way to do the thing I want...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm this is wrong, if we actually canonicalize this pointer is invalid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this should be fixed now, instead of returning a decompressor this constructs one on-the-fly to pass to a provided function
58a58eb
to
79197e9
Compare
metadata
field on CompressionTree to allow reuse between the sampling and compressing stages. For example, we can save the ALP exponents to not have to calculate them twice. This is very important for FSST so that we save the overhead of training the table twicelineitem
table'sl_comment
column, with scalefactor=1, which is just over 6million rows. By default this is loaded as a ChunkedArray with 733 partitions. Compressing with FSST enabled takes 1.6s. Compressing on the canonicalized array takes ~550ms. We should be able to speed this up by at least ~2x, see FSSTCompressor #664 (comment), and we can potentially do even better. We probably want to be able to FSST compress a ChunkedArray directly so that we avoid the overhead of training/compressing each chunk from scratch.