Refactoring to allow inlining of quick_insert_string #261

brian-pane · 2024-12-12T01:15:21Z

Before-and-after benchmarks from an x86_64 test system:

Benchmark 1 (56 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.8ms ± 2.42ms    88.2ms …  106ms          3 ( 5%)        0%
  peak_rss           26.7MB ± 65.7KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          333M  ±  836K      332M  …  335M           0 ( 0%)        0%
  instructions        747M  ±  254       747M  …  747M           0 ( 0%)        0%
  cache_references    400K  ± 6.41K      396K  …  434K           4 ( 7%)        0%
  cache_misses        299K  ± 4.24K      282K  …  311K           6 (11%)        0%
  branch_misses      3.15M  ± 5.68K     3.14M  … 3.16M           0 ( 0%)        0%
Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          89.3ms ±  582us    88.3ms … 90.8ms          2 ( 4%)          -  0.5% ±  0.7%
  peak_rss           26.7MB ± 78.2KB    26.5MB … 26.7MB          0 ( 0%)          -  0.1% ±  0.1%
  cpu_cycles          333M  ± 1.45M      331M  …  341M           1 ( 2%)          -  0.1% ±  0.1%
  instructions        736M  ±  268       736M  …  736M           1 ( 2%)        ⚡-  1.5% ±  0.0%
  cache_references    400K  ± 3.33K      397K  …  411K           3 ( 5%)          +  0.1% ±  0.5%
  cache_misses        296K  ± 6.42K      277K  …  306K           6 (11%)          -  0.9% ±  0.7%
  branch_misses      3.09M  ± 7.74K     3.07M  … 3.11M           2 ( 4%)        ⚡-  1.9% ±  0.1%

…RC32 case Before-and-after benchmarks from an x86_64 test system: ``` Benchmark 1 (56 runs): ./compress-baseline 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 89.8ms ± 2.42ms 88.2ms … 106ms 3 ( 5%) 0% peak_rss 26.7MB ± 65.7KB 26.5MB … 26.7MB 0 ( 0%) 0% cpu_cycles 333M ± 836K 332M … 335M 0 ( 0%) 0% instructions 747M ± 254 747M … 747M 0 ( 0%) 0% cache_references 400K ± 6.41K 396K … 434K 4 ( 7%) 0% cache_misses 299K ± 4.24K 282K … 311K 6 (11%) 0% branch_misses 3.15M ± 5.68K 3.14M … 3.16M 0 ( 0%) 0% Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 89.3ms ± 582us 88.3ms … 90.8ms 2 ( 4%) - 0.5% ± 0.7% peak_rss 26.7MB ± 78.2KB 26.5MB … 26.7MB 0 ( 0%) - 0.1% ± 0.1% cpu_cycles 333M ± 1.45M 331M … 341M 1 ( 2%) - 0.1% ± 0.1% instructions 736M ± 268 736M … 736M 1 ( 2%) ⚡- 1.5% ± 0.0% cache_references 400K ± 3.33K 397K … 411K 3 ( 5%) + 0.1% ± 0.5% cache_misses 296K ± 6.42K 277K … 306K 6 (11%) - 0.9% ± 0.7% branch_misses 3.09M ± 7.74K 3.07M … 3.11M 2 ( 4%) ⚡- 1.9% ± 0.1% ```

brian-pane · 2024-12-12T01:19:44Z

This is a work in progress, and it still needs some cleanup, but I'm posting this snapshot for feedback because the approach seems to improve deflate performance.
I think the benefit I'm seeing is happening less because of the inlining itself and more because the inlining enables the compiler to reuse the result of a slightly expensive read operation that happens in both quick_insert_string and deflate_quick (see the "prefetch" part in quick.rs).

… least on x86_64 Benchmarks: ``` Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 91.5ms ± 2.51ms 89.1ms … 109ms 1 ( 2%) 0% peak_rss 26.7MB ± 64.3KB 26.6MB … 26.7MB 0 ( 0%) 0% cpu_cycles 334M ± 811K 333M … 336M 0 ( 0%) 0% instructions 747M ± 306 747M … 747M 0 ( 0%) 0% cache_references 404K ± 16.4K 399K … 519K 3 ( 5%) 0% cache_misses 312K ± 3.74K 301K … 321K 5 ( 9%) 0% branch_misses 3.15M ± 6.03K 3.14M … 3.16M 0 ( 0%) 0% Benchmark 2 (57 runs): ./target/release/examples/compress 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 87.9ms ± 968us 86.0ms … 90.9ms 1 ( 2%) ⚡- 3.9% ± 0.8% peak_rss 26.7MB ± 68.0KB 26.5MB … 26.7MB 13 (23%) + 0.1% ± 0.1% cpu_cycles 319M ± 1.33M 317M … 327M 3 ( 5%) ⚡- 4.4% ± 0.1% instructions 720M ± 227 720M … 720M 0 ( 0%) ⚡- 3.6% ± 0.0% cache_references 403K ± 3.14K 399K … 410K 0 ( 0%) - 0.5% ± 1.1% cache_misses 313K ± 2.73K 302K … 317K 2 ( 4%) + 0.1% ± 0.4% branch_misses 3.15M ± 6.11K 3.14M … 3.17M 3 ( 5%) + 0.1% ± 0.1% ```

brian-pane · 2024-12-12T02:57:04Z

With that second change, I'm now seeing a surprisingly larger improvement:

Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          91.5ms ± 2.51ms    89.1ms …  109ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 64.3KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          334M  ±  811K      333M  …  336M           0 ( 0%)        0%
  instructions        747M  ±  306       747M  …  747M           0 ( 0%)        0%
  cache_references    404K  ± 16.4K      399K  …  519K           3 ( 5%)        0%
  cache_misses        312K  ± 3.74K      301K  …  321K           5 ( 9%)        0%
  branch_misses      3.15M  ± 6.03K     3.14M  … 3.16M           0 ( 0%)        0%
Benchmark 2 (57 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          87.9ms ±  968us    86.0ms … 90.9ms          1 ( 2%)        ⚡-  3.9% ±  0.8%
  peak_rss           26.7MB ± 68.0KB    26.5MB … 26.7MB         13 (23%)          +  0.1% ±  0.1%
  cpu_cycles          319M  ± 1.33M      317M  …  327M           3 ( 5%)        ⚡-  4.4% ±  0.1%
  instructions        720M  ±  227       720M  …  720M           0 ( 0%)        ⚡-  3.6% ±  0.0%
  cache_references    403K  ± 3.14K      399K  …  410K           0 ( 0%)          -  0.5% ±  1.1%
  cache_misses        313K  ± 2.73K      302K  …  317K           2 ( 4%)          +  0.1% ±  0.4%
  branch_misses      3.15M  ± 6.11K     3.14M  … 3.17M           3 ( 5%)          +  0.1% ±  0.1%

folkertdev · 2024-12-12T14:17:59Z

I can't really reproduce these results, comparing main and 40b56d1

with no RUSTFLAGS

Benchmark 1 (59 runs): target/release/examples/compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          85.8ms ±  573us    84.7ms … 87.5ms          1 ( 2%)        0%
  peak_rss           26.7MB ± 62.5KB    26.5MB … 26.7MB         14 (24%)        0%
  cpu_cycles          320M  ± 1.91M      318M  …  327M           3 ( 5%)        0%
  instructions        720M  ±  255       720M  …  720M           0 ( 0%)        0%
  cache_references   19.8M  ±  154K     19.6M  … 20.7M           1 ( 2%)        0%
  cache_misses        425K  ± 62.6K      345K  …  752K           1 ( 2%)        0%
  branch_misses      3.00M  ± 13.6K     2.99M  … 3.05M           4 ( 7%)        0%
Benchmark 2 (59 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          85.9ms ± 1.04ms    84.6ms … 90.8ms          5 ( 8%)          +  0.1% ±  0.4%
  peak_rss           26.7MB ± 56.2KB    26.6MB … 26.7MB         14 (24%)          +  0.0% ±  0.1%
  cpu_cycles          320M  ± 4.09M      316M  …  339M           5 ( 8%)          +  0.0% ±  0.4%
  instructions        720M  ±  270       720M  …  720M           1 ( 2%)          +  0.0% ±  0.0%
  cache_references   19.7M  ±  153K     19.4M  … 20.2M           2 ( 3%)          -  0.2% ±  0.3%
  cache_misses        380K  ± 62.8K      282K  …  522K           0 ( 0%)        ⚡- 10.6% ±  5.4%
  branch_misses      2.99M  ± 3.16K     2.99M  … 3.01M           1 ( 2%)          -  0.1% ±  0.1%

with RUSTFLAGS="-Ctarget-cpu=native -Cllvm-args=-enable-dfa-jump-thread" (only native should be relevant for deflate, but the jump thread one is quite relevant for inflate).

Benchmark 1 (65 runs): target/release/examples/compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.0ms ±  796us    76.8ms … 80.7ms          5 ( 8%)        0%
  peak_rss           26.7MB ± 77.2KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          283M  ± 2.68M      280M  …  293M           5 ( 8%)        0%
  instructions        650M  ±  260       650M  …  650M           0 ( 0%)        0%
  cache_references   20.0M  ±  214K     19.6M  … 21.0M           4 ( 6%)        0%
  cache_misses        445K  ±  102K      316K  … 1.01M           4 ( 6%)        0%
  branch_misses      2.99M  ± 3.37K     2.98M  … 3.00M           0 ( 0%)        0%
Benchmark 2 (65 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.0ms ±  880us    77.0ms … 82.1ms          3 ( 5%)          +  0.1% ±  0.4%
  peak_rss           26.7MB ± 64.3KB    26.6MB … 26.7MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          284M  ± 3.93M      280M  …  305M           4 ( 6%)          +  0.2% ±  0.4%
  instructions        650M  ±  283       650M  …  650M           0 ( 0%)          +  0.0% ±  0.0%
  cache_references   20.0M  ±  227K     19.7M  … 21.0M           4 ( 6%)          -  0.1% ±  0.4%
  cache_misses        485K  ±  104K      363K  …  979K           2 ( 3%)        💩+  8.9% ±  7.9%
  branch_misses      2.99M  ± 3.38K     2.98M  … 3.00M           0 ( 0%)          -  0.0% ±  0.0%

this could totally be a CPU thing though. On the face of it I would expect the changes you made to be an improvement. did you benchmark with target-cpu=native?

brian-pane · 2024-12-12T16:06:20Z

I was compiling without any RUSTFLAGS. When I build with

RUSTFLAGS="-Ctarget-cpu=native -Cllvm-args=-enable-dfa-jump-thread" cargo build --release --example compress

there is still a difference on my system, but smaller than before:

Benchmark 1 (61 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          82.6ms ± 1.47ms    80.7ms … 89.1ms          6 (10%)        0%
  peak_rss           26.7MB ± 65.6KB    26.6MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          303M  ±  752K      302M  …  305M           0 ( 0%)        0%
  instructions        655M  ±  268       655M  …  655M           0 ( 0%)        0%
  cache_references    400K  ± 8.22K      396K  …  452K           3 ( 5%)        0%
  cache_misses        292K  ± 9.42K      270K  …  316K          13 (21%)        0%
  branch_misses      3.15M  ± 8.82K     3.14M  … 3.18M           3 ( 5%)        0%
Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.2ms ±  559us    80.3ms … 82.7ms          0 ( 0%)        ⚡-  1.7% ±  0.5%
  peak_rss           26.7MB ± 62.5KB    26.6MB … 26.7MB          0 ( 0%)          +  0.1% ±  0.1%
  cpu_cycles          299M  ±  581K      298M  …  301M           1 ( 2%)        ⚡-  1.2% ±  0.1%
  instructions        646M  ±  438       646M  …  646M           1 ( 2%)        ⚡-  1.3% ±  0.0%
  cache_references    398K  ± 3.19K      396K  …  413K           3 ( 5%)          -  0.5% ±  0.5%
  cache_misses        293K  ± 7.67K      261K  …  304K           4 ( 6%)          +  0.3% ±  1.0%
  branch_misses      3.19M  ± 4.23K     3.17M  … 3.19M           2 ( 3%)          +  1.0% ±  0.1%

brian-pane · 2024-12-12T16:07:55Z

I've been testing on

model name      : 12th Gen Intel(R) Core(TM) i5-12400

folkertdev · 2024-12-12T16:26:10Z

for reference I'm on AMD Ryzen 7 5700G with Radeon Graphics.

This is not a regression anywhere as far as I can tell (tried it on aarch64 too, no significant change in walltime), and if it's an improvement for you I'd be ok with it if you clean up the code a bit.

Out of curiosity, how does zlib-rs compare to zlib-ng on your machine? I see

Benchmark 1 (68 runs): target/release/examples/blogpost-compress 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          74.4ms ± 1.22ms    72.8ms … 80.4ms          1 ( 1%)        0%
  peak_rss           26.6MB ± 15.9KB    26.5MB … 26.6MB          1 ( 1%)        0%
  cpu_cycles          266M  ± 3.66M      263M  …  288M           3 ( 4%)        0%
  instructions        460M  ±  263       460M  …  460M           0 ( 0%)        0%
  cache_references   19.6M  ±  250K     19.3M  … 21.0M           3 ( 4%)        0%
  cache_misses        530K  ±  144K      352K  … 1.45M           3 ( 4%)        0%
  branch_misses      3.32M  ± 3.39K     3.32M  … 3.34M           5 ( 7%)        0%
Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.3ms ± 1.13ms    76.5ms … 81.8ms          3 ( 5%)        💩+  5.2% ±  0.5%
  peak_rss           26.7MB ± 73.7KB    26.5MB … 26.7MB          0 ( 0%)          +  0.2% ±  0.1%
  cpu_cycles          284M  ± 3.85M      280M  …  301M           4 ( 6%)        💩+  6.7% ±  0.5%
  instructions        650M  ±  257       650M  …  650M           0 ( 0%)        💩+ 41.4% ±  0.0%
  cache_references   19.9M  ±  167K     19.7M  … 20.7M           4 ( 6%)        💩+  1.7% ±  0.4%
  cache_misses        453K  ±  103K      343K  … 1.03M           2 ( 3%)        ⚡- 14.5% ±  8.1%
  branch_misses      2.99M  ± 3.38K     2.98M  … 3.00M           0 ( 0%)        ⚡- 10.1% ±  0.0%

brian-pane · 2024-12-12T17:26:40Z

Here's what I get when I benchmark against zlib-ng (after compiling with RUSTFLAGS="-Ctarget-cpu=native -Cllvm-args=-enable-dfa-jump-thread") :

Benchmark 1 (64 runs): ./compress-baseline 1 ng silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          78.5ms ± 2.09ms    75.8ms … 92.7ms          1 ( 2%)        0%
  peak_rss           26.6MB ±    0      26.6MB … 26.6MB          0 ( 0%)        0%
  cpu_cycles          278M  ± 1.17M      277M  …  285M           3 ( 5%)        0%
  instructions        481M  ±  265       481M  …  481M           0 ( 0%)        0%
  cache_references    400K  ± 8.68K      395K  …  448K           6 ( 9%)        0%
  cache_misses        295K  ± 10.6K      267K  …  318K           0 ( 0%)        0%
  branch_misses      3.21M  ± 7.95K     3.20M  … 3.23M           0 ( 0%)        0%
Benchmark 2 (62 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.5ms ±  770us    79.9ms … 84.2ms          5 ( 8%)        💩+  3.8% ±  0.7%
  peak_rss           26.7MB ± 64.4KB    26.6MB … 26.7MB          0 ( 0%)          +  0.3% ±  0.1%
  cpu_cycles          299M  ±  710K      298M  …  302M           1 ( 2%)        💩+  7.5% ±  0.1%
  instructions        646M  ±  248       646M  …  646M           0 ( 0%)        💩+ 34.4% ±  0.0%
  cache_references    397K  ± 2.80K      395K  …  416K           7 (11%)          -  0.6% ±  0.6%
  cache_misses        295K  ± 8.09K      269K  …  316K           7 (11%)          -  0.0% ±  1.1%
  branch_misses      3.19M  ± 4.07K     3.18M  … 3.20M           0 ( 0%)          -  0.9% ±  0.1%

I'll work on cleaning up the patch.

brian-pane · 2024-12-12T18:22:25Z

I just found that this patch yields an improvement for compression level = 1, but a regression for 2 or higher.

Benchmark 1 (38 runs): ./compress-baseline 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           134ms ± 2.85ms     132ms …  148ms          3 ( 8%)        0%
  peak_rss           25.0MB ± 70.0KB    24.8MB … 25.0MB          0 ( 0%)        0%
  cpu_cycles          526M  ± 5.86M      522M  …  556M           3 ( 8%)        0%
  instructions       1.16G  ±  294      1.16G  … 1.16G           0 ( 0%)        0%
  cache_references    383K  ± 9.59K      375K  …  430K           2 ( 5%)        0%
  cache_misses        294K  ± 5.15K      282K  …  302K           0 ( 0%)        0%
  branch_misses      6.87M  ± 5.27K     6.86M  … 6.88M           0 ( 0%)        0%
Benchmark 2 (38 runs): ./target/release/examples/compress 2 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           133ms ± 1.25ms     132ms …  139ms          1 ( 3%)          -  0.5% ±  0.8%
  peak_rss           25.0MB ± 71.7KB    24.8MB … 25.0MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles          526M  ±  807K      524M  …  527M           2 ( 5%)          -  0.1% ±  0.4%
  instructions       1.17G  ±  315      1.17G  … 1.17G           2 ( 5%)        💩+  1.5% ±  0.0%
  cache_references    381K  ± 4.09K      376K  …  397K           2 ( 5%)          -  0.5% ±  0.9%
  cache_misses        296K  ± 3.94K      289K  …  303K           0 ( 0%)          +  0.9% ±  0.7%
  branch_misses      6.92M  ± 4.98K     6.91M  … 6.93M           2 ( 5%)          +  0.7% ±  0.0%

I'll have to do more digging to find out what's happening at the higher levels.

brian-pane · 2024-12-12T18:45:22Z

In the meantime, I created PR #262 which contains just the small part of this change that gives me an improvement at compression level 1 without causing a regression at higher levels.

brian-pane · 2024-12-12T23:11:49Z

When rebased to pick up the other recent optimizations, this patch now yields a regression even at compression level 1. I'll abandon this PR and do another round of profiling to look for new hotspots.

brian-pane closed this Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring to allow inlining of quick_insert_string #261

Refactoring to allow inlining of quick_insert_string #261

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024 •

edited

Loading

brian-pane commented Dec 12, 2024

folkertdev commented Dec 12, 2024

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024

folkertdev commented Dec 12, 2024

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024

Refactoring to allow inlining of quick_insert_string #261

Refactoring to allow inlining of quick_insert_string #261

Conversation

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024 • edited Loading

brian-pane commented Dec 12, 2024

folkertdev commented Dec 12, 2024

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024

folkertdev commented Dec 12, 2024

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024

brian-pane commented Dec 12, 2024 •

edited

Loading