Proof of concept: caching part of the emit_dist computation #263

brian-pane · 2024-12-14T02:47:41Z

This is a quick test that isn't suitable for the main codebase, but I'm sharing it to get feedback on the general approach: saving previously computed values to reduce the time spent in emit_dist.

The key thing I noticed is that the first half of emit_dist computes a result that depends only on the ltree and the supplied lc value. I.e., for a given ltree, there are only 256 possible combinations. As a simple test, I added in a cache to cover one of the code paths that uses the STATIC_LTREE, and it seemed to help:

Benchmark 1 (62 runs): ./compress-baseline 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          81.1ms ± 1.21ms    79.3ms … 87.2ms          2 ( 3%)        0%
  peak_rss           26.7MB ± 70.2KB    26.5MB … 26.7MB          0 ( 0%)        0%
  cpu_cycles          292M  ±  577K      291M  …  294M           2 ( 3%)        0%
  instructions        612M  ±  279       612M  …  612M           0 ( 0%)        0%
  cache_references    400K  ± 2.39K      397K  …  411K           1 ( 2%)        0%
  cache_misses        305K  ± 5.88K      285K  …  318K           9 (15%)        0%
  branch_misses      3.05M  ± 3.25K     3.04M  … 3.05M           1 ( 2%)        0%
Benchmark 2 (63 runs): ./target/release/examples/compress 1 rs silesia-small.tar
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          79.5ms ± 1.41ms    77.6ms … 86.1ms          4 ( 6%)        ⚡-  1.9% ±  0.6%
  peak_rss           26.7MB ± 73.4KB    26.5MB … 26.7MB          0 ( 0%)          +  0.0% ±  0.1%
  cpu_cycles          286M  ± 1.12M      284M  …  293M           2 ( 3%)        ⚡-  2.1% ±  0.1%
  instructions        614M  ±  429       614M  …  614M           1 ( 2%)          +  0.2% ±  0.0%
  cache_references    402K  ± 6.69K      397K  …  439K           2 ( 3%)          +  0.6% ±  0.4%
  cache_misses        305K  ± 6.43K      281K  …  320K           8 (13%)          +  0.3% ±  0.7%
  branch_misses      2.86M  ± 6.04K     2.85M  … 2.87M           0 ( 0%)        ⚡-  6.1% ±  0.1%

This patch in its current form produces a small regression for compression levels higher than 1. I could fix that by replacing the cache with a static lookup table to cover the STATIC_LTREE case. Would a caching approach be useful for the code paths that use dynamic trees, though? I don't know enough about how the dynamic trees work.

brian-pane · 2024-12-14T23:04:42Z

PR #264 implements the static lookup table approach and seems to perform better.

Proof of concept: caching half of the emit_dist computation

c66ddc7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of concept: caching part of the emit_dist computation #263

Proof of concept: caching part of the emit_dist computation #263

brian-pane commented Dec 14, 2024

brian-pane commented Dec 14, 2024

Proof of concept: caching part of the emit_dist computation #263

Are you sure you want to change the base?

Proof of concept: caching part of the emit_dist computation #263

Conversation

brian-pane commented Dec 14, 2024

brian-pane commented Dec 14, 2024