Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize longest_match inner loop #35

Open
wants to merge 1 commit into
base: gcc.amd64
Choose a base branch
from

Conversation

fhanau
Copy link

@fhanau fhanau commented Dec 4, 2022

While working on a personal data compression project that uses a modified version of cloudflare/zlib, I took an in-depth look at longest_match and was able to find some possible improvements.
This patch unrolls the inner loop in longest_match(), which is used to find the length of a potential match and depending on the data and compression level represents a major bottleneck.
The existing loop checks 8 bytes at a time and the surrounding code guarantees that strend - scan = MAXMATCH - 4 = 254 when the loop is entered (i.e., there are initially 254 bytes for the loop to check), so the scan < strend condition only becomes false after the 32nd iteration of the loop and it is possible to unroll the loop to process 16 bytes per iteration and thus effectively ignore every odd-numbered check.

On an x86 PNG benchmark using settings close to level 9, the new version speeds up deflate() by 7% using clang. I expect a smaller impact for a lower compression level or files with lower redundancy, but there should still be improvements across the board. I did not see a performance improvement when unrolling the loop further or trying to get the clang to unroll the loop itself.

I have some more improvements for longest_match if you are interested, but those may increase code complexity and require more discussion.

@fhanau
Copy link
Author

fhanau commented Feb 4, 2023

Based on additional testing, the given change decreases performance on some data, notably files with high entropy, perhaps due to worse pipelining. I think there's still potential to improve longest_match(), I'll play around with it a bit more.

@nmoinvaz
Copy link

nmoinvaz commented Feb 5, 2023

@fhanau you may want to take a look at what we have done in match_tpl.h in zlib-ng.

@fhanau
Copy link
Author

fhanau commented Feb 5, 2023

@nmoinvaz Thanks for the suggestion! Some of the changes are perhaps too extensive to fit in here, but I think the vectorized COMPARE256() looks interesting, I looked at vectorizing the inner loop but didn't consider _mm_cmpeq_epi8 before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants