Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize the checking of the first two bytes of a potential match. #259

Merged
merged 1 commit into from
Dec 8, 2024

Conversation

brian-pane
Copy link

Before-and-after benchmark results on x86_64:

Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          91.4ms ± 1.12ms    89.9ms … 97.9ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 52.9KB    26.6MB … 26.7MB         11 (20%)        0%
  cpu_cycles          341M  ±  743K      340M  …  343M           0 ( 0%)        0%
  instructions        748M  ±  261       748M  …  748M           0 ( 0%)        0%
  cache_references    401K  ± 6.61K      398K  …  436K           8 (15%)        0%
  cache_misses        298K  ± 8.08K      273K  …  312K           9 (16%)        0%
  branch_misses      3.28M  ± 4.77K     3.27M  … 3.29M           0 ( 0%)        0%
Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.5ms ±  596us    88.1ms … 90.9ms          0 ( 0%)        ⚡-  2.1% ±  0.4%
  peak_rss           26.7MB ± 50.7KB    26.6MB … 26.7MB         10 (18%)          +  0.0% ±  0.1%
  cpu_cycles          334M  ±  657K      332M  …  335M           1 ( 2%)        ⚡-  2.3% ±  0.1%
  instructions        747M  ±  274       747M  …  747M           1 ( 2%)          -  0.1% ±  0.0%
  cache_references    400K  ± 3.67K      397K  …  418K           6 (11%)          -  0.3% ±  0.5%
  cache_misses        299K  ± 5.78K      278K  …  305K           5 ( 9%)          +  0.4% ±  0.9%
  branch_misses      3.16M  ± 5.78K     3.15M  … 3.18M           1 ( 2%)        ⚡-  3.6% ±  0.1%

Before-and-after benchmark results on x86_64:
```
Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          91.4ms ± 1.12ms    89.9ms … 97.9ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 52.9KB    26.6MB … 26.7MB         11 (20%)        0%
  cpu_cycles          341M  ±  743K      340M  …  343M           0 ( 0%)        0%
  instructions        748M  ±  261       748M  …  748M           0 ( 0%)        0%
  cache_references    401K  ± 6.61K      398K  …  436K           8 (15%)        0%
  cache_misses        298K  ± 8.08K      273K  …  312K           9 (16%)        0%
  branch_misses      3.28M  ± 4.77K     3.27M  … 3.29M           0 ( 0%)        0%
Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.5ms ±  596us    88.1ms … 90.9ms          0 ( 0%)        ⚡-  2.1% ±  0.4%
  peak_rss           26.7MB ± 50.7KB    26.6MB … 26.7MB         10 (18%)          +  0.0% ±  0.1%
  cpu_cycles          334M  ±  657K      332M  …  335M           1 ( 2%)        ⚡-  2.3% ±  0.1%
  instructions        747M  ±  274       747M  …  747M           1 ( 2%)          -  0.1% ±  0.0%
  cache_references    400K  ± 3.67K      397K  …  418K           6 (11%)          -  0.3% ±  0.5%
  cache_misses        299K  ± 5.78K      278K  …  305K           5 ( 9%)          +  0.4% ±  0.9%
  branch_misses      3.16M  ± 5.78K     3.15M  … 3.18M           1 ( 2%)        ⚡-  3.6% ±  0.1%
```
@folkertdev
Copy link
Collaborator

Neat, I think I assumed at the time that surely the compiler would be smart enough to figure this out, but no

Now, on my machine the difference between the two versions is not significant, but the reduction in instruction count is real, and this is a hot part of the algorithm, so I'm happy to accept this.

Btw if you are on the hunt for optimization opportunities, we have the biggest gap with the existing algorithms for the lower compression levels. Run e.g.

poop "target/release/examples/blogpost-compress 2 ng silesia-small.tar"  "target/release/examples/blogpost-compress 2 rs silesia-small.tar"

here ng is https://github.com/zlib-ng/zlib-ng, if i remember correctly the numbers are broadly similar for zlib-chromium.

@folkertdev folkertdev merged commit 47afe59 into trifectatechfoundation:main Dec 8, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants