Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring to allow inlining of quick_insert_string #261

Closed

Conversation

brian-pane
Copy link

Before-and-after benchmarks from an x86_64 test system:

Benchmark 1 (56 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.8ms ± 2.42ms    88.2ms …  106ms          3 ( 5%)        0%
  peak_rss           26.7MB ± 65.7KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          333M  ±  836K      332M  …  335M           0 ( 0%)        0%
  instructions        747M  ±  254       747M  …  747M           0 ( 0%)        0%
  cache_references    400K  ± 6.41K      396K  …  434K           4 ( 7%)        0%
  cache_misses        299K  ± 4.24K      282K  …  311K           6 (11%)        0%
  branch_misses      3.15M  ± 5.68K     3.14M  … 3.16M           0 ( 0%)        0%
Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.3ms ±  582us    88.3ms … 90.8ms          2 ( 4%)          -  0.5% ±  0.7%
  peak_rss           26.7MB ± 78.2KB    26.5MB … 26.7MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          333M  ± 1.45M      331M  …  341M           1 ( 2%)          -  0.1% ±  0.1%
  instructions        736M  ±  268       736M  …  736M           1 ( 2%)        ⚡-  1.5% ±  0.0%
  cache_references    400K  ± 3.33K      397K  …  411K           3 ( 5%)          +  0.1% ±  0.5%
  cache_misses        296K  ± 6.42K      277K  …  306K           6 (11%)          -  0.9% ±  0.7%
  branch_misses      3.09M  ± 7.74K     3.07M  … 3.11M           2 ( 4%)        ⚡-  1.9% ±  0.1%

…RC32 case

Before-and-after benchmarks from an x86_64 test system:
```
Benchmark 1 (56 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.8ms ± 2.42ms    88.2ms …  106ms          3 ( 5%)        0%
  peak_rss           26.7MB ± 65.7KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          333M  ±  836K      332M  …  335M           0 ( 0%)        0%
  instructions        747M  ±  254       747M  …  747M           0 ( 0%)        0%
  cache_references    400K  ± 6.41K      396K  …  434K           4 ( 7%)        0%
  cache_misses        299K  ± 4.24K      282K  …  311K           6 (11%)        0%
  branch_misses      3.15M  ± 5.68K     3.14M  … 3.16M           0 ( 0%)        0%
Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.3ms ±  582us    88.3ms … 90.8ms          2 ( 4%)          -  0.5% ±  0.7%
  peak_rss           26.7MB ± 78.2KB    26.5MB … 26.7MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          333M  ± 1.45M      331M  …  341M           1 ( 2%)          -  0.1% ±  0.1%
  instructions        736M  ±  268       736M  …  736M           1 ( 2%)        ⚡-  1.5% ±  0.0%
  cache_references    400K  ± 3.33K      397K  …  411K           3 ( 5%)          +  0.1% ±  0.5%
  cache_misses        296K  ± 6.42K      277K  …  306K           6 (11%)          -  0.9% ±  0.7%
  branch_misses      3.09M  ± 7.74K     3.07M  … 3.11M           2 ( 4%)        ⚡-  1.9% ±  0.1%
```
@brian-pane
Copy link
Author

brian-pane commented Dec 12, 2024

This is a work in progress, and it still needs some cleanup, but I'm posting this snapshot for feedback because the approach seems to improve deflate performance.
I think the benefit I'm seeing is happening less because of the inlining itself and more because the inlining enables the compiler to reuse the result of a slightly expensive read operation that happens in both quick_insert_string and deflate_quick (see the "prefetch" part in quick.rs).

… least on x86_64

Benchmarks:
```
Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          91.5ms ± 2.51ms    89.1ms …  109ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 64.3KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          334M  ±  811K      333M  …  336M           0 ( 0%)        0%
  instructions        747M  ±  306       747M  …  747M           0 ( 0%)        0%
  cache_references    404K  ± 16.4K      399K  …  519K           3 ( 5%)        0%
  cache_misses        312K  ± 3.74K      301K  …  321K           5 ( 9%)        0%
  branch_misses      3.15M  ± 6.03K     3.14M  … 3.16M           0 ( 0%)        0%
Benchmark 2 (57 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          87.9ms ±  968us    86.0ms … 90.9ms          1 ( 2%)        ⚡-  3.9% ±  0.8%
  peak_rss           26.7MB ± 68.0KB    26.5MB … 26.7MB         13 (23%)          +  0.1% ±  0.1%
  cpu_cycles          319M  ± 1.33M      317M  …  327M           3 ( 5%)        ⚡-  4.4% ±  0.1%
  instructions        720M  ±  227       720M  …  720M           0 ( 0%)        ⚡-  3.6% ±  0.0%
  cache_references    403K  ± 3.14K      399K  …  410K           0 ( 0%)          -  0.5% ±  1.1%
  cache_misses        313K  ± 2.73K      302K  …  317K           2 ( 4%)          +  0.1% ±  0.4%
  branch_misses      3.15M  ± 6.11K     3.14M  … 3.17M           3 ( 5%)          +  0.1% ±  0.1%
```
@brian-pane
Copy link
Author

With that second change, I'm now seeing a surprisingly larger improvement:

Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          91.5ms ± 2.51ms    89.1ms …  109ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 64.3KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          334M  ±  811K      333M  …  336M           0 ( 0%)        0%
  instructions        747M  ±  306       747M  …  747M           0 ( 0%)        0%
  cache_references    404K  ± 16.4K      399K  …  519K           3 ( 5%)        0%
  cache_misses        312K  ± 3.74K      301K  …  321K           5 ( 9%)        0%
  branch_misses      3.15M  ± 6.03K     3.14M  … 3.16M           0 ( 0%)        0%
Benchmark 2 (57 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          87.9ms ±  968us    86.0ms … 90.9ms          1 ( 2%)        ⚡-  3.9% ±  0.8%
  peak_rss           26.7MB ± 68.0KB    26.5MB … 26.7MB         13 (23%)          +  0.1% ±  0.1%
  cpu_cycles          319M  ± 1.33M      317M  …  327M           3 ( 5%)        ⚡-  4.4% ±  0.1%
  instructions        720M  ±  227       720M  …  720M           0 ( 0%)        ⚡-  3.6% ±  0.0%
  cache_references    403K  ± 3.14K      399K  …  410K           0 ( 0%)          -  0.5% ±  1.1%
  cache_misses        313K  ± 2.73K      302K  …  317K           2 ( 4%)          +  0.1% ±  0.4%
  branch_misses      3.15M  ± 6.11K     3.14M  … 3.17M           3 ( 5%)          +  0.1% ±  0.1%

@folkertdev
Copy link
Collaborator

I can't really reproduce these results, comparing main and 40b56d1

with no RUSTFLAGS

Benchmark 1 (59 runs): target/release/examples/compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          85.8ms ±  573us    84.7ms … 87.5ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 62.5KB    26.5MB … 26.7MB         14 (24%)        0%
  cpu_cycles          320M  ± 1.91M      318M  …  327M           3 ( 5%)        0%
  instructions        720M  ±  255       720M  …  720M           0 ( 0%)        0%
  cache_references   19.8M  ±  154K     19.6M  … 20.7M           1 ( 2%)        0%
  cache_misses        425K  ± 62.6K      345K  …  752K           1 ( 2%)        0%
  branch_misses      3.00M  ± 13.6K     2.99M  … 3.05M           4 ( 7%)        0%
Benchmark 2 (59 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          85.9ms ± 1.04ms    84.6ms … 90.8ms          5 ( 8%)          +  0.1% ±  0.4%
  peak_rss           26.7MB ± 56.2KB    26.6MB … 26.7MB         14 (24%)          +  0.0% ±  0.1%
  cpu_cycles          320M  ± 4.09M      316M  …  339M           5 ( 8%)          +  0.0% ±  0.4%
  instructions        720M  ±  270       720M  …  720M           1 ( 2%)          +  0.0% ±  0.0%
  cache_references   19.7M  ±  153K     19.4M  … 20.2M           2 ( 3%)          -  0.2% ±  0.3%
  cache_misses        380K  ± 62.8K      282K  …  522K           0 ( 0%)        ⚡- 10.6% ±  5.4%
  branch_misses      2.99M  ± 3.16K     2.99M  … 3.01M           1 ( 2%)          -  0.1% ±  0.1%

with RUSTFLAGS="-Ctarget-cpu=native -Cllvm-args=-enable-dfa-jump-thread" (only native should be relevant for deflate, but the jump thread one is quite relevant for inflate).

Benchmark 1 (65 runs): target/release/examples/compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.0ms ±  796us    76.8ms … 80.7ms          5 ( 8%)        0%
  peak_rss           26.7MB ± 77.2KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          283M  ± 2.68M      280M  …  293M           5 ( 8%)        0%
  instructions        650M  ±  260       650M  …  650M           0 ( 0%)        0%
  cache_references   20.0M  ±  214K     19.6M  … 21.0M           4 ( 6%)        0%
  cache_misses        445K  ±  102K      316K  … 1.01M           4 ( 6%)        0%
  branch_misses      2.99M  ± 3.37K     2.98M  … 3.00M           0 ( 0%)        0%
Benchmark 2 (65 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.0ms ±  880us    77.0ms … 82.1ms          3 ( 5%)          +  0.1% ±  0.4%
  peak_rss           26.7MB ± 64.3KB    26.6MB … 26.7MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          284M  ± 3.93M      280M  …  305M           4 ( 6%)          +  0.2% ±  0.4%
  instructions        650M  ±  283       650M  …  650M           0 ( 0%)          +  0.0% ±  0.0%
  cache_references   20.0M  ±  227K     19.7M  … 21.0M           4 ( 6%)          -  0.1% ±  0.4%
  cache_misses        485K  ±  104K      363K  …  979K           2 ( 3%)        💩+  8.9% ±  7.9%
  branch_misses      2.99M  ± 3.38K     2.98M  … 3.00M           0 ( 0%)          -  0.0% ±  0.0%

this could totally be a CPU thing though. On the face of it I would expect the changes you made to be an improvement. did you benchmark with target-cpu=native?

@brian-pane
Copy link
Author

I was compiling without any RUSTFLAGS. When I build with

RUSTFLAGS="-Ctarget-cpu=native -Cllvm-args=-enable-dfa-jump-thread" cargo build --release --example compress

there is still a difference on my system, but smaller than before:

Benchmark 1 (61 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          82.6ms ± 1.47ms    80.7ms … 89.1ms          6 (10%)        0%
  peak_rss           26.7MB ± 65.6KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          303M  ±  752K      302M  …  305M           0 ( 0%)        0%
  instructions        655M  ±  268       655M  …  655M           0 ( 0%)        0%
  cache_references    400K  ± 8.22K      396K  …  452K           3 ( 5%)        0%
  cache_misses        292K  ± 9.42K      270K  …  316K          13 (21%)        0%
  branch_misses      3.15M  ± 8.82K     3.14M  … 3.18M           3 ( 5%)        0%
Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.2ms ±  559us    80.3ms … 82.7ms          0 ( 0%)        ⚡-  1.7% ±  0.5%
  peak_rss           26.7MB ± 62.5KB    26.6MB … 26.7MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          299M  ±  581K      298M  …  301M           1 ( 2%)        ⚡-  1.2% ±  0.1%
  instructions        646M  ±  438       646M  …  646M           1 ( 2%)        ⚡-  1.3% ±  0.0%
  cache_references    398K  ± 3.19K      396K  …  413K           3 ( 5%)          -  0.5% ±  0.5%
  cache_misses        293K  ± 7.67K      261K  …  304K           4 ( 6%)          +  0.3% ±  1.0%
  branch_misses      3.19M  ± 4.23K     3.17M  … 3.19M           2 ( 3%)          +  1.0% ±  0.1%

@brian-pane
Copy link
Author

I've been testing on

model name      : 12th Gen Intel(R) Core(TM) i5-12400

@folkertdev
Copy link
Collaborator

for reference I'm on AMD Ryzen 7 5700G with Radeon Graphics.

This is not a regression anywhere as far as I can tell (tried it on aarch64 too, no significant change in walltime), and if it's an improvement for you I'd be ok with it if you clean up the code a bit.

Out of curiosity, how does zlib-rs compare to zlib-ng on your machine? I see

Benchmark 1 (68 runs): target/release/examples/blogpost-compress 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          74.4ms ± 1.22ms    72.8ms … 80.4ms          1 ( 1%)        0%
  peak_rss           26.6MB ± 15.9KB    26.5MB … 26.6MB          1 ( 1%)        0%
  cpu_cycles          266M  ± 3.66M      263M  …  288M           3 ( 4%)        0%
  instructions        460M  ±  263       460M  …  460M           0 ( 0%)        0%
  cache_references   19.6M  ±  250K     19.3M  … 21.0M           3 ( 4%)        0%
  cache_misses        530K  ±  144K      352K  … 1.45M           3 ( 4%)        0%
  branch_misses      3.32M  ± 3.39K     3.32M  … 3.34M           5 ( 7%)        0%
Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.3ms ± 1.13ms    76.5ms … 81.8ms          3 ( 5%)        💩+  5.2% ±  0.5%
  peak_rss           26.7MB ± 73.7KB    26.5MB … 26.7MB          0 ( 0%)          +  0.2% ±  0.1%
  cpu_cycles          284M  ± 3.85M      280M  …  301M           4 ( 6%)        💩+  6.7% ±  0.5%
  instructions        650M  ±  257       650M  …  650M           0 ( 0%)        💩+ 41.4% ±  0.0%
  cache_references   19.9M  ±  167K     19.7M  … 20.7M           4 ( 6%)        💩+  1.7% ±  0.4%
  cache_misses        453K  ±  103K      343K  … 1.03M           2 ( 3%)        ⚡- 14.5% ±  8.1%
  branch_misses      2.99M  ± 3.38K     2.98M  … 3.00M           0 ( 0%)        ⚡- 10.1% ±  0.0%

@brian-pane
Copy link
Author

Here's what I get when I benchmark against zlib-ng (after compiling with RUSTFLAGS="-Ctarget-cpu=native -Cllvm-args=-enable-dfa-jump-thread") :

Benchmark 1 (64 runs): ./compress-baseline 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.5ms ± 2.09ms    75.8ms … 92.7ms          1 ( 2%)        0%
  peak_rss           26.6MB ±    0      26.6MB … 26.6MB          0 ( 0%)        0%
  cpu_cycles          278M  ± 1.17M      277M  …  285M           3 ( 5%)        0%
  instructions        481M  ±  265       481M  …  481M           0 ( 0%)        0%
  cache_references    400K  ± 8.68K      395K  …  448K           6 ( 9%)        0%
  cache_misses        295K  ± 10.6K      267K  …  318K           0 ( 0%)        0%
  branch_misses      3.21M  ± 7.95K     3.20M  … 3.23M           0 ( 0%)        0%
Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.5ms ±  770us    79.9ms … 84.2ms          5 ( 8%)        💩+  3.8% ±  0.7%
  peak_rss           26.7MB ± 64.4KB    26.6MB … 26.7MB          0 ( 0%)          +  0.3% ±  0.1%
  cpu_cycles          299M  ±  710K      298M  …  302M           1 ( 2%)        💩+  7.5% ±  0.1%
  instructions        646M  ±  248       646M  …  646M           0 ( 0%)        💩+ 34.4% ±  0.0%
  cache_references    397K  ± 2.80K      395K  …  416K           7 (11%)          -  0.6% ±  0.6%
  cache_misses        295K  ± 8.09K      269K  …  316K           7 (11%)          -  0.0% ±  1.1%
  branch_misses      3.19M  ± 4.07K     3.18M  … 3.20M           0 ( 0%)          -  0.9% ±  0.1%

I'll work on cleaning up the patch.

@brian-pane
Copy link
Author

I just found that this patch yields an improvement for compression level = 1, but a regression for 2 or higher.

Benchmark 1 (38 runs): ./compress-baseline 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           134ms ± 2.85ms     132ms …  148ms          3 ( 8%)        0%
  peak_rss           25.0MB ± 70.0KB    24.8MB … 25.0MB          0 ( 0%)        0%
  cpu_cycles          526M  ± 5.86M      522M  …  556M           3 ( 8%)        0%
  instructions       1.16G  ±  294      1.16G  … 1.16G           0 ( 0%)        0%
  cache_references    383K  ± 9.59K      375K  …  430K           2 ( 5%)        0%
  cache_misses        294K  ± 5.15K      282K  …  302K           0 ( 0%)        0%
  branch_misses      6.87M  ± 5.27K     6.86M  … 6.88M           0 ( 0%)        0%
Benchmark 2 (38 runs): ./target/release/examples/compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           133ms ± 1.25ms     132ms …  139ms          1 ( 3%)          -  0.5% ±  0.8%
  peak_rss           25.0MB ± 71.7KB    24.8MB … 25.0MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles          526M  ±  807K      524M  …  527M           2 ( 5%)          -  0.1% ±  0.4%
  instructions       1.17G  ±  315      1.17G  … 1.17G           2 ( 5%)        💩+  1.5% ±  0.0%
  cache_references    381K  ± 4.09K      376K  …  397K           2 ( 5%)          -  0.5% ±  0.9%
  cache_misses        296K  ± 3.94K      289K  …  303K           0 ( 0%)          +  0.9% ±  0.7%
  branch_misses      6.92M  ± 4.98K     6.91M  … 6.93M           2 ( 5%)          +  0.7% ±  0.0%

I'll have to do more digging to find out what's happening at the higher levels.

@brian-pane
Copy link
Author

In the meantime, I created PR #262 which contains just the small part of this change that gives me an improvement at compression level 1 without causing a regression at higher levels.

@brian-pane
Copy link
Author

When rebased to pick up the other recent optimizations, this patch now yields a regression even at compression level 1. I'll abandon this PR and do another round of profiling to look for new hotspots.

@brian-pane brian-pane closed this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants