Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: port in more from the C++ code #24

Merged
merged 14 commits into from
Sep 3, 2024
Merged

feat: port in more from the C++ code #24

merged 14 commits into from
Sep 3, 2024

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Aug 23, 2024

This PR ports in some more functionality based on the MIT-licensed C++ code from CWI.

In particular, it implements the following:

  • The makeSample function from C++ to build a sample of ~16KB from the input data
  • The suffix limit optimization and its corresponding finalize method needed when building the symbol table, including changes to the compress_word function we have that more directly corresponds to the compressVariant from the C++ code
  • The byteCodes from C++, which we implement here as codes_one_byte. Note that before this PR, one-byte codes would not be found unless the byte occurred at the end of the plaintext string
  • Separates the Compressor build state into a new CompressorBuilder struct, which has all methods that take &mut self. This also means that we can in theory construct a Compressor now from a symbol table, though that logic is not implemented.

Additional things in this PR:

  • Added a micro benchmark for compress_word method comparing the relative speeds of both code paths, see feat: port in more from the C++ code #24 (comment)
  • Removed many of the old small-data benchmarks. I've added several of the dbtext compression benchmarks from the CWI paper. Here's a table of the compression factors:
dbtext c++ compress factor fsst-rs compress factor
l_comment 2.73 2.69
urls 2.33 2.27
wikipedia 1.81 1.75

I'll follow up to figure out how to close the gap with those 1-2% differences

bytes: [u8; 8],
num: u64,
}
pub struct Symbol(u64);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going from union -> newtype struct shows up in the cargo asm. before symbol was getting spilled into stack, now it's passed in registers

src/lib.rs Outdated Show resolved Hide resolved
@a10y a10y changed the title make_sample ported from c++ port in more from the C++ code Sep 2, 2024
@a10y a10y marked this pull request as ready for review September 2, 2024 20:42
#[cfg(miri)]
const MAX_GENERATIONS: usize = 2;
/// Entrypoint for building a new `Compressor`.
pub struct CompressorBuilder {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much of this is just moving the old methods from Compressor into a new struct

src/builder.rs Outdated Show resolved Hide resolved
benches/micro.rs Outdated
Comment on lines 28 to 54
group.bench_function("compress-hashtab", |b| {
// We create a symbol table and an input that will execute exactly one iteration,
// in the fast compress_word pathway.
let mut compressor = CompressorBuilder::new();
compressor.insert(Symbol::from_slice(b"abcdefgh"), 8);
let compressor = compressor.build();

b.iter(|| unsafe {
compressor.compress_into(
b"abcdefghabcdefghabcdefghabcdefghabcdefghabcdefghabcdefghabcdefgh",
&mut output_buf,
);
});
});

group.bench_function("compress-twobytes", |b| {
// We create a symbol table and an input that will execute exactly one iteration,
// in the fast compress_word pathway.
let mut compressor = CompressorBuilder::new();
compressor.insert(Symbol::from_slice(&[b'a', b'b', 0, 0, 0, 0, 0, 0]), 8);
let compressor = compressor.build();

b.iter(|| unsafe {
compressor.compress_into(b"abababababababab", &mut output_buf);
});
});
group.finish();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test the relative speeds of the two compression pathways: one where we're able to skip the hashtable lookup and one where we are not. Skipping the lookup is nearly twice as fast

image

@a10y a10y changed the title port in more from the C++ code feat: port in more from the C++ code Sep 3, 2024
/// a load of `N` values from the pointer in a minimum number of instructions into
/// an output `u64`.
#[inline]
unsafe fn extract_u64<const N: usize>(ptr: *const u8) -> u64 {
Copy link
Contributor Author

@a10y a10y Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With better benchmarking it became clear this had no effect on performance VS just doing a copy_non_overlapping, and was less clear in general

@a10y a10y merged commit c944de6 into develop Sep 3, 2024
3 checks passed
@a10y a10y deleted the aduffy/make_sample branch September 3, 2024 17:55
@github-actions github-actions bot mentioned this pull request Sep 3, 2024
a10y pushed a commit that referenced this pull request Sep 3, 2024
## 🤖 New release
* `fsst-rs`: 0.2.3 -> 0.3.0

<details><summary><i><b>Changelog</b></i></summary><p>

<blockquote>

## [0.3.0](v0.2.3...v0.3.0) -
2024-09-03

### Added
- port in more from the C++ code
([#24](#24))

### Other
- centering ([#26](#26))
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/MarcoIeni/release-plz/).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant