Parallelize the checking of the first two bytes of a potential match. #259

brian-pane · 2024-12-08T20:09:17Z

Before-and-after benchmark results on x86_64:

Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          91.4ms ± 1.12ms    89.9ms … 97.9ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 52.9KB    26.6MB … 26.7MB         11 (20%)        0%
  cpu_cycles          341M  ±  743K      340M  …  343M           0 ( 0%)        0%
  instructions        748M  ±  261       748M  …  748M           0 ( 0%)        0%
  cache_references    401K  ± 6.61K      398K  …  436K           8 (15%)        0%
  cache_misses        298K  ± 8.08K      273K  …  312K           9 (16%)        0%
  branch_misses      3.28M  ± 4.77K     3.27M  … 3.29M           0 ( 0%)        0%
Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.5ms ±  596us    88.1ms … 90.9ms          0 ( 0%)        ⚡-  2.1% ±  0.4%
  peak_rss           26.7MB ± 50.7KB    26.6MB … 26.7MB         10 (18%)          +  0.0% ±  0.1%
  cpu_cycles          334M  ±  657K      332M  …  335M           1 ( 2%)        ⚡-  2.3% ±  0.1%
  instructions        747M  ±  274       747M  …  747M           1 ( 2%)          -  0.1% ±  0.0%
  cache_references    400K  ± 3.67K      397K  …  418K           6 (11%)          -  0.3% ±  0.5%
  cache_misses        299K  ± 5.78K      278K  …  305K           5 ( 9%)          +  0.4% ±  0.9%
  branch_misses      3.16M  ± 5.78K     3.15M  … 3.18M           1 ( 2%)        ⚡-  3.6% ±  0.1%

Before-and-after benchmark results on x86_64: ``` Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 91.4ms ± 1.12ms 89.9ms … 97.9ms 1 ( 2%) 0% peak_rss 26.7MB ± 52.9KB 26.6MB … 26.7MB 11 (20%) 0% cpu_cycles 341M ± 743K 340M … 343M 0 ( 0%) 0% instructions 748M ± 261 748M … 748M 0 ( 0%) 0% cache_references 401K ± 6.61K 398K … 436K 8 (15%) 0% cache_misses 298K ± 8.08K 273K … 312K 9 (16%) 0% branch_misses 3.28M ± 4.77K 3.27M … 3.29M 0 ( 0%) 0% Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 89.5ms ± 596us 88.1ms … 90.9ms 0 ( 0%) ⚡- 2.1% ± 0.4% peak_rss 26.7MB ± 50.7KB 26.6MB … 26.7MB 10 (18%) + 0.0% ± 0.1% cpu_cycles 334M ± 657K 332M … 335M 1 ( 2%) ⚡- 2.3% ± 0.1% instructions 747M ± 274 747M … 747M 1 ( 2%) - 0.1% ± 0.0% cache_references 400K ± 3.67K 397K … 418K 6 (11%) - 0.3% ± 0.5% cache_misses 299K ± 5.78K 278K … 305K 5 ( 9%) + 0.4% ± 0.9% branch_misses 3.16M ± 5.78K 3.15M … 3.18M 1 ( 2%) ⚡- 3.6% ± 0.1% ```

folkertdev · 2024-12-08T21:09:29Z

Neat, I think I assumed at the time that surely the compiler would be smart enough to figure this out, but no

Now, on my machine the difference between the two versions is not significant, but the reduction in instruction count is real, and this is a hot part of the algorithm, so I'm happy to accept this.

Btw if you are on the hunt for optimization opportunities, we have the biggest gap with the existing algorithms for the lower compression levels. Run e.g.

poop "target/release/examples/blogpost-compress 2 ng silesia-small.tar"  "target/release/examples/blogpost-compress 2 rs silesia-small.tar"

here ng is https://github.com/zlib-ng/zlib-ng, if i remember correctly the numbers are broadly similar for zlib-chromium.

folkertdev approved these changes Dec 8, 2024

View reviewed changes

folkertdev merged commit 47afe59 into trifectatechfoundation:main Dec 8, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize the checking of the first two bytes of a potential match. #259

Parallelize the checking of the first two bytes of a potential match. #259

brian-pane commented Dec 8, 2024

folkertdev commented Dec 8, 2024

Parallelize the checking of the first two bytes of a potential match. #259

Parallelize the checking of the first two bytes of a potential match. #259

Conversation

brian-pane commented Dec 8, 2024

folkertdev commented Dec 8, 2024