feat: port in more from the C++ code #24

a10y · 2024-08-23T02:44:50Z

This PR ports in some more functionality based on the MIT-licensed C++ code from CWI.

In particular, it implements the following:

The makeSample function from C++ to build a sample of ~16KB from the input data
The suffix limit optimization and its corresponding finalize method needed when building the symbol table, including changes to the compress_word function we have that more directly corresponds to the compressVariant from the C++ code
The byteCodes from C++, which we implement here as codes_one_byte. Note that before this PR, one-byte codes would not be found unless the byte occurred at the end of the plaintext string
Separates the Compressor build state into a new CompressorBuilder struct, which has all methods that take &mut self. This also means that we can in theory construct a Compressor now from a symbol table, though that logic is not implemented.

Additional things in this PR:

Added a micro benchmark for compress_word method comparing the relative speeds of both code paths, see feat: port in more from the C++ code #24 (comment)
Removed many of the old small-data benchmarks. I've added several of the dbtext compression benchmarks from the CWI paper. Here's a table of the compression factors:

dbtext	c++ compress factor	fsst-rs compress factor
l_comment	2.73	2.69
urls	2.33	2.27
wikipedia	1.81	1.75

I'll follow up to figure out how to close the gap with those 1-2% differences

a10y · 2024-08-23T02:46:15Z

src/lib.rs

-    bytes: [u8; 8],
-    num: u64,
-}
+pub struct Symbol(u64);


going from union -> newtype struct shows up in the cargo asm. before symbol was getting spilled into stack, now it's passed in registers

src/lib.rs

a10y · 2024-09-03T00:01:47Z

src/builder.rs

-#[cfg(miri)]
-const MAX_GENERATIONS: usize = 2;
+/// Entrypoint for building a new `Compressor`.
+pub struct CompressorBuilder {


much of this is just moving the old methods from Compressor into a new struct

src/builder.rs

a10y · 2024-09-03T15:24:11Z

benches/micro.rs

+    group.bench_function("compress-hashtab", |b| {
+        // We create a symbol table and an input that will execute exactly one iteration,
+        // in the fast compress_word pathway.
+        let mut compressor = CompressorBuilder::new();
+        compressor.insert(Symbol::from_slice(b"abcdefgh"), 8);
+        let compressor = compressor.build();
+
+        b.iter(|| unsafe {
+            compressor.compress_into(
+                b"abcdefghabcdefghabcdefghabcdefghabcdefghabcdefghabcdefghabcdefgh",
+                &mut output_buf,
+            );
+        });
+    });
+
+    group.bench_function("compress-twobytes", |b| {
+        // We create a symbol table and an input that will execute exactly one iteration,
+        // in the fast compress_word pathway.
+        let mut compressor = CompressorBuilder::new();
+        compressor.insert(Symbol::from_slice(&[b'a', b'b', 0, 0, 0, 0, 0, 0]), 8);
+        let compressor = compressor.build();
+
+        b.iter(|| unsafe {
+            compressor.compress_into(b"abababababababab", &mut output_buf);
+        });
+    });
+    group.finish();


These test the relative speeds of the two compression pathways: one where we're able to skip the hashtable lookup and one where we are not. Skipping the lookup is nearly twice as fast

a10y · 2024-09-03T16:49:00Z

src/lib.rs

-/// a load of `N` values from the pointer in a minimum number of instructions into
-/// an output `u64`.
-#[inline]
-unsafe fn extract_u64<const N: usize>(ptr: *const u8) -> u64 {


With better benchmarking it became clear this had no effect on performance VS just doing a copy_non_overlapping, and was less clear in general

## 🤖 New release * `fsst-rs`: 0.2.3 -> 0.3.0 <details><summary>Changelog</summary> <blockquote> ## [0.3.0](v0.2.3...v0.3.0) - 2024-09-03 ### Added - port in more from the C++ code ([#24](#24)) ### Other - centering ([#26](#26)) </blockquote> </details> --- This PR was generated with [release-plz](https://github.com/MarcoIeni/release-plz/). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

make_sample ported from c++

b334653

a10y commented Aug 23, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

a10y mentioned this pull request Aug 23, 2024

FSSTCompressor spiraldb/vortex#664

Merged

a10y added 4 commits August 23, 2024 11:39

save

f1702e7

always train in bulk

9f4448b

save

28b9c68

better

0fef992

a10y mentioned this pull request Aug 24, 2024

add single bytes as candidates #20

Closed

a10y added 3 commits August 28, 2024 20:55

more

9cb272c

the rest of the owl

e693ee7

appease docs

53b3413

a10y changed the title ~~make_sample ported from c++~~ port in more from the C++ code Sep 2, 2024

cleanup

0b727ac

a10y marked this pull request as ready for review September 2, 2024 20:42

more clean

41a4e57

a10y commented Sep 3, 2024

View reviewed changes

src/builder.rs Outdated Show resolved Hide resolved

a10y added 2 commits September 3, 2024 11:04

better benchmarks, cleanup

358ef31

remove wrong comment

d595138

a10y commented Sep 3, 2024

View reviewed changes

minimize the benchmark for illustrative purposes

9c9ad25

a10y changed the title ~~port in more from the C++ code~~ feat: port in more from the C++ code Sep 3, 2024

remove method that does not affect perf

15f11b0

a10y commented Sep 3, 2024

View reviewed changes

a10y merged commit c944de6 into develop Sep 3, 2024
3 checks passed

a10y deleted the aduffy/make_sample branch September 3, 2024 17:55

github-actions bot mentioned this pull request Sep 3, 2024

chore: release v0.3.0 #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: port in more from the C++ code #24

feat: port in more from the C++ code #24

a10y commented Aug 23, 2024 •

edited

Loading

a10y Aug 23, 2024

a10y Sep 3, 2024

a10y Sep 3, 2024

a10y Sep 3, 2024 •

edited

Loading

feat: port in more from the C++ code #24

feat: port in more from the C++ code #24

Conversation

a10y commented Aug 23, 2024 • edited Loading

a10y Aug 23, 2024

Choose a reason for hiding this comment

a10y Sep 3, 2024

Choose a reason for hiding this comment

a10y Sep 3, 2024

Choose a reason for hiding this comment

a10y Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

a10y commented Aug 23, 2024 •

edited

Loading

a10y Sep 3, 2024 •

edited

Loading