Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement crc32 on more targets #14

Open
6 of 7 tasks
folkertdev opened this issue Feb 1, 2024 · 13 comments
Open
6 of 7 tasks

implement crc32 on more targets #14

folkertdev opened this issue Feb 1, 2024 · 13 comments

Comments

@folkertdev
Copy link
Collaborator

folkertdev commented Feb 1, 2024

  • a basic implementation that is obviously correct to test against
  • an efficient rust version
  • move the std::arch::x86_64::_mm_crc32_u32 intrinsic into the crc32 module, make it platform-independent (/src/deflate/hash_calc.rs).

Then of course we also need simd-accelerated versions. zlib-ng has several in the /arch folder. In rust there is https://docs.rs/crc32fast/latest/crc32fast/. I don't think we want that as a dependency but it may provide inspiration.

  • SSE
  • AVX2
  • AVX512
  • Neon
@ahomescu
Copy link
Contributor

ahomescu commented Feb 7, 2024

I started prototyping this, have an MVP for the first item on the list but have some quick comments.

First, after reading the docs for _mm_crc32_u32 it looks to me like it implements CRC32-C which is a different polynomial than CRC32 that zlib uses. crc32fast actually uses pclmulqdq which is slightly more complicated. Do we need CRC-32C?

Second, how do you feel about quickcheck for testing? It would be pretty useful here.

Third, afaict adler32_rust is a manually unrolled version of naive_adler32. Did you have that in mind for crc32, or something more advanced (e.g. a larger table that processes 2-4 bytes at a time)?

ahomescu added a commit to immunant/zlib-rs that referenced this issue Feb 7, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
ahomescu added a commit to immunant/zlib-rs that referenced this issue Feb 7, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
@folkertdev
Copy link
Collaborator Author

Cool that you're picking this up!


The mm_crc32_u32 function is used not for gzip, but for the hash chaining in zlib-ng on systems that support it. I suspect the particulars of what polynomial is used don't particularly matter here, you really just want a fast decent hash.

My proposal would be to just use mm_crc32_u32 as an intrinsic. The fallback is

#define HASH_CALC(s, h, val) h = ((val * 2654435761U) >> HASH_SLIDE);

which does look different to the intel docs for _mm_crc32_u32. But that does not really matter so long as the same function is used consistently. I was under the impression that the fallback would have the same behavior as _mm_crc32_u32 but that seems to not be the case.


We're very mindful of dependencies for the final crate/dynamic library, but dev-dependencies should be OK.

I haven't really used quickcheck (in rust, I have in Haskell) but let's give it a go!


I think it makes sense to look at the zlib-ng codebase next to see how they implement crc32. We can be reasonably confident they have one of the fastest implementations out there.

Navigating this code is not awesome, there is lots of macro expansion that makes it hard to understand what is happening. The generic crc32 implementation is here

I think that is a good jumping-off point. You'll need some other things like the huge tables that crc32 uses, they are in that repository as well. Also it's probably a good thing to look at what architecture-specific features are implemented (in particular the type signatures).

@thedataking
Copy link

We're very mindful of dependencies for the final crate/dynamic library

I don't think we want that as a dependency but it may provide inspiration.

Very sorry if this is a dumb question. Andrei and I (Per) were taking last night and we weren't sure what your concerns with crc32fast were. It seems widely used (including in Android) and is permissively licensed so it must be something else we're not aware of.

ahomescu added a commit to immunant/zlib-rs that referenced this issue Feb 8, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
@folkertdev
Copy link
Collaborator Author

No that is a very reasonable question. Many things are coming together here

  • from sudo-rs and ntpd-rs we know that having fewer dependencies makes the distribution process easier.
  • we believe we can have zero dependencies for this project (in a release build)
  • there is the risk of supply chain attack. I'm personally not too worried about that actually causing issues in this case, but we make that structurally impossible by having no dependencies
  • fewer dependencies -> faster builds

In this particular case, I think crc32fast also does not do exactly what we need. What I've noticed so far without diving deep into the exact implementation details:

  • no fused crc32/memcpy: the adler and crc implementations can first add a simd value into the checksum calculation and then copy it to a destination. this is configured with the COPY constant https://github.com/zlib-ng/zlib-ng/blob/develop/arch/x86/crc32_fold_pclmulqdq_tpl.h#L20
  • limited support, most relevantly that there is no avx512 version. We don't consider these targets in current milestones, but zlib-ng also supports powerpc and s390)

So, I'd be totally happy to use crc32fast if we'd just be creating a simple checksum somewhere, but crc is integral to what zlib does and it's used in some custom ways, so I think be in control ourselves.

Separately I'm a big fan of upstreaming features we do add.

does that make sense?

@thedataking
Copy link

thedataking commented Feb 12, 2024

does that make sense?

yes. Thank you for explaining the reasoning. I agree that supply chain attacks is something that must be prevented. However, as we've discussed, vendoring crates or relying on crate versions that have been vetted might be worth considering as a general approach. Google has open sourced their crate audits and it includes crc32fast version 1.3.2 [0]

Of course that doesn't address your concern that the crate doesn't do everything needed so I am not arguing that for this particular case, using an existing crate is optimal. Putting that aside, how do you think about a general policy of "have as few dependencies as a general rule but reuse functionality available from mature and audited crates"?

[0] https://github.com/google/rust-crate-audits/blob/main/manual-sources/google3-audits.toml

ahomescu added a commit to immunant/zlib-rs that referenced this issue Feb 13, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
Add an optimized word-size implementation
based on Kadatch and Jenkins.
Add optimized interleaved implementation from
section 4.11 of the same paper.
ahomescu added a commit to immunant/zlib-rs that referenced this issue Feb 13, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
Add an optimized word-size implementation
based on Kadatch and Jenkins.
Add optimized interleaved implementation from
section 4.11 of the same paper.
@folkertdev
Copy link
Collaborator Author

I think my personal taste is to be quite hardcore about owning the logic that is core to your project. You should know how it works, and it's likely that you can be more efficient than a generic solution when you have a specific problem.

As an example, nptd-rs uses tokio and rustls, massive dependencies and not something we'd just write ourselves obviously. But for the parsing and configuring of sockets and the clock, that is all our own lowlevel code that builds on libc and the standard library. We did use nix at some point but we knew we could do a better job by doing it ourselves.

ntpd-rs also used clap for command line argument parsing. I think clap is fine, it is in widespread use and I trust its development team. But, it's a very large dependency, that for us did more than we needed. It also releases breaking changes quite frequently, so it is disliked by the debian/fedora package maintainers. It's big, so we didn't really want to vendor/maintain it, so we just ripped it out. The code is nicer now, it builds that little bit faster, the binary is that little bit smaller, and we're more in control of the code we actually ship.

Of course you have to be pragmatic, but given how easy it is to accumulate dependencies, I think being skeptical of adding one is the right baseline.

btw for what dependencies we use we mostly go by popularity in the rust ecosystem (clap, tokio, rustls are all widely used and actively maintained), occasionally informed by packaging concerns (this can also be a reason not to upgrade a dependency). So far we've not had reasons to stick to a particular list of audited libraries, but that seems like a decent approach that saves you quite a bit of work.

ahomescu added a commit to immunant/zlib-rs that referenced this issue Feb 14, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
Add an optimized word-size implementation
based on Kadatch and Jenkins.
Add optimized interleaved implementation from
section 4.11 of the same paper.
ahomescu added a commit to immunant/zlib-rs that referenced this issue Feb 14, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
Add an optimized word-size implementation
based on Kadatch and Jenkins.
Add optimized interleaved implementation from
section 4.11 of the same paper.
ahomescu added a commit to immunant/zlib-rs that referenced this issue Feb 14, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
Add an optimized word-size implementation
based on Kadatch and Jenkins.
Add optimized interleaved implementation from
section 4.11 of the same paper.
ahomescu added a commit to immunant/zlib-rs that referenced this issue Feb 15, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
Add an optimized word-size implementation
based on Kadatch and Jenkins.
Add optimized interleaved implementation from
section 4.11 of the same paper.
folkertdev pushed a commit that referenced this issue Feb 16, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
Add an optimized word-size implementation
based on Kadatch and Jenkins.
Add optimized interleaved implementation from
section 4.11 of the same paper.
folkertdev pushed a commit that referenced this issue Feb 16, 2024
Add a naive implementation of CRC-32 based
on the C code from the PNG specification.
Add an optimized word-size implementation
based on Kadatch and Jenkins.
Add optimized interleaved implementation from
section 4.11 of the same paper.
@brian-pane
Copy link

brian-pane commented Dec 17, 2024

Is anyone still working on a pure-Rust crc32 implementation? If not, I'm willing to try creating two implementations: a simple, slow version as a baseline, and then a fast version inspired by the one in zlib-ng.

And should it be CRC32 or CRC32-C? It seems like the current implementation in hash_calc.rs does CRC32-C on x86 and CRC32 on ARM, although the ARM docs list both CRC32 and CRC32-C instructions available.

@folkertdev
Copy link
Collaborator Author

we have a pure-rust implementation, that has the expected simd support on the targets we care most about. The implementation lives here. However, if you believe you can make it faster, or support more platforms, by all means have a got at that.

For the logic in hash_calc.rs, we simply follow the behavior of zlib-ng, but for arm/aarch64 we haven't done sophisticated benchmarking for which is faster. I believe zlib-ng actually fell back to using a software implementation of crc32 because using the instruction is faster, but combined with the branching logic for whether to use the fallback or the instruction, the instruction is overall slower (for them, we should absolutely verify that for ourselves).

We've been working on proper benchmarking (on dedicated hardware (apple m2 and a recent amd), with graphs etc), so we'll be in a much better shape data-wise to evaluate the various options.

@nmoinvaz
Copy link

nmoinvaz commented Dec 22, 2024

The mm_crc32_u32 function is used not for gzip, but for the hash chaining in zlib-ng on systems that support it.

We abandoned using intrinsics for this purpose, due to this commit in Chromium. It also gave us more consistent number of compressed bytes across architectures.

@brian-pane
Copy link

brian-pane commented Dec 22, 2024

We abandoned using intrinsics for this purpose, due to this commit in Chromium. It also gave us more consistent number of compressed bytes across architectures.

I experimented with always using the software hash (StandardHashCalc, which I believe implements the same multiply-and-shift computation used in zlib-ng) instead of the hardware CRC-32 implementation. On x86_64, using the software hash currently yields a regression at low compression levels:

Benchmark 1 (63 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          79.3ms ± 2.26ms    77.7ms … 95.6ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 66.0KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          288M  ±  615K      287M  …  291M           1 ( 2%)        0%
  instructions        590M  ±  253       590M  …  590M           0 ( 0%)        0%
  cache_references    401K  ± 3.98K      397K  …  421K           5 ( 8%)        0%
  cache_misses        295K  ± 6.91K      277K  …  308K           7 (11%)        0%
  branch_misses      2.95M  ± 4.27K     2.94M  … 2.96M           1 ( 2%)        0%
Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.6ms ±  877us    80.2ms … 85.5ms          4 ( 6%)        💩+  3.0% ±  0.8%
  peak_rss           26.7MB ± 68.6KB    26.5MB … 26.7MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          298M  ±  792K      296M  …  301M           2 ( 3%)        💩+  3.2% ±  0.1%
  instructions        573M  ±  254       573M  …  573M           0 ( 0%)        ⚡-  2.8% ±  0.0%
  cache_references    403K  ± 19.4K      394K  …  551K           5 ( 8%)          +  0.6% ±  1.2%
  cache_misses        296K  ± 13.0K      250K  …  327K           7 (11%)          +  0.1% ±  1.2%
  branch_misses      3.14M  ± 5.23K     3.13M  … 3.15M           0 ( 0%)        💩+  6.4% ±  0.1%

With that said, though, one way to look at it is that zlib-rs's use of the hardware CRC as a faster hash function is compensating for inefficiencies elsewhere. So as other parts of the codebase get faster, it might be possible to get comparable overall performance when using the software hash.

Also, it looks like the software hash outperforms the hardware CRC on AArch64 (at least the Apple implementation), but I need to create a microbenchmark suite to be sure.

@nmoinvaz
Copy link

I think the benchmark needs to be isolated to only the hash function, since each function will result in different hash tables/matches.

@brian-pane
Copy link

Right, the better deflate performance of CRC32 compared to the multiply-and-shift software hash is probably due more to the distribution of the hash results (fewer collisions, for example) than to the time spent in the hash calculation itself.

But I just realized that the hash function adopted by Chromium is different from the portable hash used in zlib-rs and zlib-ng.

  • zlib-rs: val.wrapping_mul(2654435761) >> 16
  • zlib-ng: ((val * 2654435761U) >> 16
  • Chromium: (value * 66521 + 66521) >> 16

I just tried a quick test of the Chromium hash in zlib-rs, and it seems to yield deflate performance comparable to the hardware CRC32 hash. I'll do more testing, and if everything looks good I'll submit a PR.

@brian-pane
Copy link

I just tried a quick test of the Chromium hash in zlib-rs, and it seems to yield deflate performance comparable to the hardware CRC32 hash. I'll do more testing, and if everything looks good I'll submit a PR.

Update: on x86_64, the Chromium hash performs well compared to CRC32 at compression level 2:

Benchmark 1 (39 runs): ./compress-baseline 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           131ms ± 3.36ms     128ms …  148ms          4 (10%)        0%
  peak_rss           25.0MB ± 64.6KB    24.9MB … 25.0MB          0 ( 0%)        0%
  cpu_cycles          513M  ± 4.61M      509M  …  537M           3 ( 8%)        0%
  instructions       1.11G  ±  242      1.11G  … 1.11G           0 ( 0%)        0%
  cache_references    385K  ± 16.2K      375K  …  456K           5 (13%)        0%
  cache_misses        294K  ± 11.0K      248K  …  305K           3 ( 8%)        0%
  branch_misses      6.85M  ± 12.3K     6.84M  … 6.91M           2 ( 5%)        0%
Benchmark 2 (38 runs): ./target/release/examples/compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           132ms ± 1.55ms     129ms …  139ms          4 (11%)          +  0.3% ±  0.9%
  peak_rss           25.0MB ± 66.0KB    24.9MB … 25.0MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles          517M  ± 2.69M      515M  …  532M           1 ( 3%)          +  1.0% ±  0.3%
  instructions       1.09G  ±  297      1.09G  … 1.09G           0 ( 0%)        ⚡-  1.7% ±  0.0%
  cache_references    381K  ± 7.54K      374K  …  409K           3 ( 8%)          -  1.0% ±  1.5%
  cache_misses        296K  ± 4.37K      282K  …  302K           0 ( 0%)          +  0.6% ±  1.3%
  branch_misses      6.89M  ± 6.93K     6.88M  … 6.91M           0 ( 0%)          +  0.5% ±  0.1%

but still shows a regression at compression level 1:

Benchmark 1 (62 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.1ms ± 1.68ms    78.5ms … 91.6ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 66.0KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          289M  ±  761K      287M  …  291M           1 ( 2%)        0%
  instructions        590M  ±  276       590M  …  590M           0 ( 0%)        0%
  cache_references    404K  ± 8.54K      398K  …  452K           4 ( 6%)        0%
  cache_misses        304K  ± 8.30K      270K  …  316K           2 ( 3%)        0%
  branch_misses      2.95M  ± 4.74K     2.94M  … 2.96M           0 ( 0%)        0%
Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.8ms ±  895us    79.4ms … 83.7ms          1 ( 2%)          +  0.9% ±  0.6%
  peak_rss           26.7MB ± 65.5KB    26.6MB … 26.7MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          295M  ±  731K      294M  …  297M           1 ( 2%)        💩+  2.2% ±  0.1%
  instructions        578M  ±  263       578M  …  578M           0 ( 0%)        ⚡-  1.9% ±  0.0%
  cache_references    402K  ± 6.59K      397K  …  441K           3 ( 5%)          -  0.4% ±  0.7%
  cache_misses        304K  ± 8.28K      267K  …  313K           1 ( 2%)          -  0.0% ±  1.0%
  branch_misses      3.05M  ± 6.63K     3.03M  … 3.06M           1 ( 2%)        💩+  3.5% ±  0.1%

@folkertdev folkertdev changed the title implement crc32 implement crc32 on more targets Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants