-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring to allow inlining of quick_insert_string #261
Refactoring to allow inlining of quick_insert_string #261
Conversation
…RC32 case Before-and-after benchmarks from an x86_64 test system: ``` Benchmark 1 (56 runs): ./compress-baseline 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 89.8ms ± 2.42ms 88.2ms … 106ms 3 ( 5%) 0% peak_rss 26.7MB ± 65.7KB 26.5MB … 26.7MB 0 ( 0%) 0% cpu_cycles 333M ± 836K 332M … 335M 0 ( 0%) 0% instructions 747M ± 254 747M … 747M 0 ( 0%) 0% cache_references 400K ± 6.41K 396K … 434K 4 ( 7%) 0% cache_misses 299K ± 4.24K 282K … 311K 6 (11%) 0% branch_misses 3.15M ± 5.68K 3.14M … 3.16M 0 ( 0%) 0% Benchmark 2 (56 runs): ./target/release/examples/compress 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 89.3ms ± 582us 88.3ms … 90.8ms 2 ( 4%) - 0.5% ± 0.7% peak_rss 26.7MB ± 78.2KB 26.5MB … 26.7MB 0 ( 0%) - 0.1% ± 0.1% cpu_cycles 333M ± 1.45M 331M … 341M 1 ( 2%) - 0.1% ± 0.1% instructions 736M ± 268 736M … 736M 1 ( 2%) ⚡- 1.5% ± 0.0% cache_references 400K ± 3.33K 397K … 411K 3 ( 5%) + 0.1% ± 0.5% cache_misses 296K ± 6.42K 277K … 306K 6 (11%) - 0.9% ± 0.7% branch_misses 3.09M ± 7.74K 3.07M … 3.11M 2 ( 4%) ⚡- 1.9% ± 0.1% ```
This is a work in progress, and it still needs some cleanup, but I'm posting this snapshot for feedback because the approach seems to improve deflate performance. |
… least on x86_64 Benchmarks: ``` Benchmark 1 (55 runs): ./compress-baseline 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 91.5ms ± 2.51ms 89.1ms … 109ms 1 ( 2%) 0% peak_rss 26.7MB ± 64.3KB 26.6MB … 26.7MB 0 ( 0%) 0% cpu_cycles 334M ± 811K 333M … 336M 0 ( 0%) 0% instructions 747M ± 306 747M … 747M 0 ( 0%) 0% cache_references 404K ± 16.4K 399K … 519K 3 ( 5%) 0% cache_misses 312K ± 3.74K 301K … 321K 5 ( 9%) 0% branch_misses 3.15M ± 6.03K 3.14M … 3.16M 0 ( 0%) 0% Benchmark 2 (57 runs): ./target/release/examples/compress 1 rs silesia-small.tar measurement mean ± σ min … max outliers delta wall_time 87.9ms ± 968us 86.0ms … 90.9ms 1 ( 2%) ⚡- 3.9% ± 0.8% peak_rss 26.7MB ± 68.0KB 26.5MB … 26.7MB 13 (23%) + 0.1% ± 0.1% cpu_cycles 319M ± 1.33M 317M … 327M 3 ( 5%) ⚡- 4.4% ± 0.1% instructions 720M ± 227 720M … 720M 0 ( 0%) ⚡- 3.6% ± 0.0% cache_references 403K ± 3.14K 399K … 410K 0 ( 0%) - 0.5% ± 1.1% cache_misses 313K ± 2.73K 302K … 317K 2 ( 4%) + 0.1% ± 0.4% branch_misses 3.15M ± 6.11K 3.14M … 3.17M 3 ( 5%) + 0.1% ± 0.1% ```
With that second change, I'm now seeing a surprisingly larger improvement:
|
I can't really reproduce these results, comparing with no
with
this could totally be a CPU thing though. On the face of it I would expect the changes you made to be an improvement. did you benchmark with |
I was compiling without any RUSTFLAGS. When I build with
there is still a difference on my system, but smaller than before:
|
I've been testing on
|
for reference I'm on This is not a regression anywhere as far as I can tell (tried it on aarch64 too, no significant change in walltime), and if it's an improvement for you I'd be ok with it if you clean up the code a bit. Out of curiosity, how does zlib-rs compare to zlib-ng on your machine? I see
|
Here's what I get when I benchmark against zlib-ng (after compiling with
I'll work on cleaning up the patch. |
I just found that this patch yields an improvement for compression level = 1, but a regression for 2 or higher.
I'll have to do more digging to find out what's happening at the higher levels. |
In the meantime, I created PR #262 which contains just the small part of this change that gives me an improvement at compression level 1 without causing a regression at higher levels. |
When rebased to pick up the other recent optimizations, this patch now yields a regression even at compression level 1. I'll abandon this PR and do another round of profiling to look for new hotspots. |
Before-and-after benchmarks from an x86_64 test system: