-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize a scan of non state-chaning bytes with SSSE3 instructions #58
base: main
Are you sure you want to change the base?
Conversation
71da0d6
to
98cfe64
Compare
This commit optimizes the scan of non-state-changing bytes using SSE2 instructions. A [_mm_cmpestri](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_cmpestri) operation appears to be quite slow compared to alternative approach that involves [_mm_shuffle_epi8](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_shuffle_epi8) for low/high nibble of the input and using bitwise-and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any) that changes the state. That is done by figuring out the first byte that LUTs to zero. The tricky part here is the following: ``` Find A, B arrays (uint8_t[16]) such that * `A[i] & B[j] == 0` if `LUT[i | (j <<4)] == 0` * `A[i] & B[j] != 0` if `LUT[i | (j <<4)] != 0` // Note we don't need any specific non-zero value for all i,j = 0..15. ``` To find `A` and `B` satisfying the above conditions a [Z3](https://github.com/Z3Prover/z3) library is used. The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit) was required to handle the async API for z3-wrapper. Using llhttp as a benchmark framework this change draws the following improvemnts: ``` Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz http: "seanmonstar/httparse" (C) BEFORE: 8192.00 mb | 1456.72 mb/s | 2172811.81 ops/sec | 5.62 s AFTER: 8192.00 mb | 1752.90 mb/s | 2614577.82 ops/sec | 4.67 s ~20% improvement http: "nodejs/http-parser" (C) BEFORE: 8192.00 mb | 1050.60 mb/s | 2118535.14 ops/sec | 7.80 s AFTER: 8192.00 mb | 1167.42 mb/s | 2354101.76 ops/sec | 7.02 s ~11% improvement ``` For more header-fields-heavy messages numbers might be even more convincing.
98cfe64
to
5b7c3a9
Compare
Here are the details behind the change: https://gist.github.com/ngrodzitski/f9de8fb60ae44e7c5455f9b78a2e4134 |
The previous commit actually uses SSSE3 instruction.
b9cd9fe
to
7602ea1
Compare
@anonrig @RafaelGSS I'm not familiar with SSE. Can you please chime in here? |
@ShogunPanda I'll have a look when I am back in Montreal. |
This is very reasonable algorithmically. @ngrodzitski is using vectorized classification... See section "3.1.2 Vectorized Classification" in Parsing Gigabytes of JSON per Second. It is used extensively within simdutf, for example. The current implementation is specific to x64: one might want to extend it to other architectures like NEON (though it is not a limitation specific to this PR). My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever). If I am correct, then this work might be in vain, as far as Node.js is concerned. So before I review more carefully the PR, I'd be interested in knowing whether we are making sure that this code is actually being put to good use. If not, then that's what we should fix first. (It is easily fixable, thankfully.) @ngrodzitski : are your benchmarks relevant to Node.js, or did you manually compile the code with your own flags? |
That's correct. That would need to be done in the I assume this is a huge effort, right @richardlau? |
@lemire I can't say for Node.js, but I've recently ported restinio which initially used old nodejs/http_parser to llhttp, thus I've given llhttp a closer look which eventually brings me here because my interest is that llhttp is fast (and correct of course). So regarding a performance check I use llhttp repo which has a simple benchmark in it. Here How I did it:
So my data (20% and 11%) is for llhttp. |
I'm not entirely sure what is being asked. I'm not very familiar with llhttp and how it is generated using llparse -- for Node.js the gyp files used to build llhttp are: |
I meant, in case we want to support sse4.2, we'll need to build |
I don't think we'd need an arch specific build -- I think usually the dependency (in this case llhttp) would have to to detect and select the code to use at runtime. I don't know how easy that will be to do in llhttp so "huge effort" may or not be correct (albeit I'd expect most of the effort to be on the llhttp side rather than the Node.js side). |
And to follow up, the sse4.2 code would need to be in a separate compilation unit (i.e. .c file). |
@richardlau Yes. You definitively do not want to pass an architecture specific flag to the compiler when building Node.js, unless you want to build your own private version.
That's one way to do it. If you look at simdutf (which is already in Node.js), you will find that there is just one compilation unit (literally just one .cpp). In any case, it is solvable here as well if we care. |
See simdutf for a reference on how you can make it easy to build (just one .cpp or .c file would do). |
@RafaelGSS Do you have a good http benchmark already that could be used if one wants to demonstrate benefits? |
I'm not sure about the impacts of this optimization, but I believe https://github.com/nodejs/node/blob/main/benchmark/http/simple.js should be enough if that's a general parsing optimization. Using framework benchmarks should be good as well (see: https://github.com/fastify/benchmarks) |
This commit optimizes the scan of non-state-changing bytes using SSE2 instructions.
A _mm_cmpestri operation appears to be quite slow compared to alternative approach that involves
_mm_shuffle_epi8 for low/high nibble of the input and using bitwise-and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any) that changes the state. That is done by figuring out the first byte that LUTs to zero.
The tricky part here is the following:
To find
A
andB
satisfying the above conditions a Z3 library is used. The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit) was required to handle the async API for z3-wrapper.Using llhttp as a benchmark framework this change draws the following improvemnts:
For more header-fields-heavy messages numbers might be even more convincing.