Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSSTCompressor #664

Merged
merged 16 commits into from
Sep 3, 2024
Merged

FSSTCompressor #664

merged 16 commits into from
Sep 3, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Aug 20, 2024

  • Implements a new codec, FSSTCompressor using latest version of fsst-rs library
  • Adds new metadata field on CompressionTree to allow reuse between the sampling and compressing stages. For example, we can save the ALP exponents to not have to calculate them twice. This is very important for FSST so that we save the overhead of training the table twice
  • Adds new compression benchmark using the lineitem table's l_comment column, with scalefactor=1, which is just over 6million rows. By default this is loaded as a ChunkedArray with 733 partitions. Compressing with FSST enabled takes 1.6s. Compressing on the canonicalized array takes ~550ms. We should be able to speed this up by at least ~2x, see FSSTCompressor #664 (comment), and we can potentially do even better. We probably want to be able to FSST compress a ChunkedArray directly so that we avoid the overhead of training/compressing each chunk from scratch.

@a10y
Copy link
Contributor Author

a10y commented Aug 21, 2024

image

Ran the compress_taxi benchmark, got ~80% slower. I am a bit surprised that the biggest culprit seems to be creating new counters in the FSST training loop. That doesn't even scale w.r.t. to the size of the input array, it's just a flat 2MB allocation. The zeroing of the vector seems to be the biggest problem. I think we can avoid that with a second bitmap, let me try that out

@a10y
Copy link
Contributor Author

a10y commented Aug 21, 2024

Alright, using the change in spiraldb/fsst#21 helped a lot.

New benchmark result:

end to end - taxi/compress
                        time:   [100.73 ms 101.72 ms 102.96 ms]
                        change: [-45.073% -43.470% -42.057%] (p = 0.00 < 0.05)
                        Performance has improved.

Which is about 10ms or ~11% slower than running without FSST.

@a10y
Copy link
Contributor Author

a10y commented Aug 21, 2024

And I think we can go even lower, ideally we'd just use the trained compressor over the samples to compress the full array

@robert3005
Copy link
Member

Just bear in mind that the samples can be very small compared to data, i.e. 1024 elements. I would say just retrain it

@a10y
Copy link
Contributor Author

a10y commented Aug 23, 2024

Ok I've done a few things today

  1. Introduced a way to reuse compressors in our samplingcompressor code
  2. Keep tweaking some things on the FSST side, including matching how the paper author's sample the full input, and tried to reduce memory accesses/extraneous instructions as much as possible (in draft at feat: port in more from the C++ code fsst#24)
  3. I'm trying to run some comparisons against the C++ code. Here's a screenshot comparing a microbenchmark for compressing thecomments column from the TPCH orders table (1.5mm rows) using Rust vs the C++ implementation. We seem roughly on-par with the C++ implementation here, the timings seemed consistent after several runs

image

Cargo.toml Outdated Show resolved Hide resolved
@a10y
Copy link
Contributor Author

a10y commented Aug 23, 2024

Ok I added a new benchmark now which just compresses the comments column in-memory via Vortex, and i'm seeing it take ~500ms, which is roughly 2-3x longer than just doing the compression without Vortex.

I think the root of the performance difference is the chunking. Here's a comparison between running FSST over the comments column chunked as per our TPC loading infra (nchunks=192) and the canonicalized version of the comments array, which is not chunked:

image

So somewhere I guess there's some fixed-size overhead in FSST training (probably a combo of allocations and double-tight-loops over 0...511) that when you try and run FSST hundreds of times, they start to add up and can skew your results.

I'm curious how DuckDB and other folks deal with FSST + chunking, it seems like we might want to treat it as a special thing that can do its own sampling + have shared symbol table across chunks

@a10y
Copy link
Contributor Author

a10y commented Aug 23, 2024

I'm currently blocking this on some work in spiraldb/fsst#24

@a10y
Copy link
Contributor Author

a10y commented Sep 3, 2024

image

Currently 59% of fsst_compress time is spent actually compressing, we break out of the fast loop to do push_null and data copying. Something to improve on in flup

// so we transmute to kill the lifetime complaints.
// This is fine because the returned `Decompressor`'s lifetime is tied to the lifetime
// of these same arrays.
let symbol_lengths = unsafe { std::mem::transmute::<&[u8], &[u8]>(symbol_lengths) };
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious for a sanity check here, or if there's another way i should be doing this. it feels a bit wrong, but I think it is currently the best way to do the thing I want...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm this is wrong, if we actually canonicalize this pointer is invalid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this should be fixed now, instead of returning a decompressor this constructs one on-the-fly to pass to a provided function

@a10y a10y marked this pull request as ready for review September 3, 2024 19:30
@a10y a10y force-pushed the aduffy/fsst-compressor branch from 58a58eb to 79197e9 Compare September 3, 2024 21:26
@a10y a10y enabled auto-merge (squash) September 3, 2024 22:20
@a10y a10y merged commit fd49140 into develop Sep 3, 2024
4 checks passed
@a10y a10y deleted the aduffy/fsst-compressor branch September 3, 2024 22:29
@robert3005 robert3005 mentioned this pull request Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants