Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize a scan of non state-chaning bytes with SSSE3 instructions #58

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ngrodzitski
Copy link

@ngrodzitski ngrodzitski commented Oct 10, 2023

This commit optimizes the scan of non-state-changing bytes using SSE2 instructions.

A _mm_cmpestri operation appears to be quite slow compared to alternative approach that involves
_mm_shuffle_epi8 for low/high nibble of the input and using bitwise-and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any) that changes the state. That is done by figuring out the first byte that LUTs to zero.

The tricky part here is the following:

Find A, B arrays (uint8_t[16]) such that
* `A[i] & B[j] == 0` if `LUT[i | (j <<4)] == 0`
* `A[i] & B[j] != 0` if `LUT[i | (j <<4)] != 0` // Note we don't need any specific non-zero value
for all i,j = 0..15.

To find A and B satisfying the above conditions a Z3 library is used. The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit) was required to handle the async API for z3-wrapper.

Using llhttp as a benchmark framework this change draws the following improvemnts:

Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz

http: "seanmonstar/httparse" (C)
BEFORE: 8192.00 mb | 1456.72 mb/s | 2172811.81 ops/sec | 5.62 s
AFTER:  8192.00 mb | 1752.90 mb/s | 2614577.82 ops/sec | 4.67 s

~20% improvement

http: "nodejs/http-parser" (C)
BEFORE: 8192.00 mb | 1050.60 mb/s | 2118535.14 ops/sec | 7.80 s
AFTER:  8192.00 mb | 1167.42 mb/s | 2354101.76 ops/sec | 7.02 s

~11% improvement

For more header-fields-heavy messages numbers might be even more convincing.

This commit optimizes the scan of non-state-changing bytes using SSE2 instructions.

A [_mm_cmpestri](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_cmpestri) operation appears to be quite slow
compared to alternative approach that involves [_mm_shuffle_epi8](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_shuffle_epi8)
for low/high nibble of the input and using bitwise-and for the results to get a 16 bytes of LUT in one go (it also involves a bunch of other SSE2 operations
which all have nice latency/throughput properties). The resulting LUT of 16 bytes can be analyzed (also vectorized) to get the index of the first byte (if any)
that changes the state. That is done by figuring out the first byte that LUTs to zero.

The tricky part here is the following:

```
Find A, B arrays (uint8_t[16]) such that
* `A[i] & B[j] == 0` if `LUT[i | (j <<4)] == 0`
* `A[i] & B[j] != 0` if `LUT[i | (j <<4)] != 0` // Note we don't need any specific non-zero value
for all i,j = 0..15.
```

To find `A` and `B` satisfying the above conditions a [Z3](https://github.com/Z3Prover/z3) library is used.
The npm package that wrapps z3 for using in ts is not particularly friendly to the author of this change so another package (synckit)
was required to handle the async API for z3-wrapper.

Using llhttp as a benchmark framework this change draws the following improvemnts:

```
Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz

http: "seanmonstar/httparse" (C)
BEFORE: 8192.00 mb | 1456.72 mb/s | 2172811.81 ops/sec | 5.62 s
AFTER:  8192.00 mb | 1752.90 mb/s | 2614577.82 ops/sec | 4.67 s

~20% improvement

http: "nodejs/http-parser" (C)
BEFORE: 8192.00 mb | 1050.60 mb/s | 2118535.14 ops/sec | 7.80 s
AFTER:  8192.00 mb | 1167.42 mb/s | 2354101.76 ops/sec | 7.02 s

~11% improvement
```

For more header-fields-heavy messages numbers might be even more convincing.
@ngrodzitski ngrodzitski changed the title Optimize a scan of non state-chaning bytes with SSE2 instructions Optimize a scan of non state-chaning bytes with SSE3 instructions Oct 10, 2023
@ngrodzitski
Copy link
Author

Here are the details behind the change: https://gist.github.com/ngrodzitski/f9de8fb60ae44e7c5455f9b78a2e4134

The previous commit actually uses SSSE3 instruction.
@ngrodzitski ngrodzitski changed the title Optimize a scan of non state-chaning bytes with SSE3 instructions Optimize a scan of non state-chaning bytes with SSSE3 instructions Oct 11, 2023
@ShogunPanda
Copy link
Contributor

@anonrig @RafaelGSS I'm not familiar with SSE. Can you please chime in here?

@lemire
Copy link
Member

lemire commented Nov 7, 2023

@ShogunPanda I'll have a look when I am back in Montreal.

@lemire
Copy link
Member

lemire commented Nov 8, 2023

This is very reasonable algorithmically. @ngrodzitski is using vectorized classification... See section "3.1.2 Vectorized Classification" in Parsing Gigabytes of JSON per Second. It is used extensively within simdutf, for example. The current implementation is specific to x64: one might want to extend it to other architectures like NEON (though it is not a limitation specific to this PR).

My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).

If I am correct, then this work might be in vain, as far as Node.js is concerned.

So before I review more carefully the PR, I'd be interested in knowing whether we are making sure that this code is actually being put to good use.

If not, then that's what we should fix first. (It is easily fixable, thankfully.)

@ngrodzitski : are your benchmarks relevant to Node.js, or did you manually compile the code with your own flags?

@RafaelGSS
Copy link
Member

My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).

That's correct. That would need to be done in the deps/llhttp first so we could, technically include Node.js, but I'm almost sure it would need a new dist creation (or at the very least - updating the current building flow) to have this support.

I assume this is a huge effort, right @richardlau?

@ngrodzitski
Copy link
Author

ngrodzitski commented Nov 8, 2023

@lemire
You are right, the source from which I learned this technique is simdjson. I've only added nice automation with z3 so that this works for generic LUT (if possible per se).

I can't say for Node.js, but I've recently ported restinio which initially used old nodejs/http_parser to llhttp, thus I've given llhttp a closer look which eventually brings me here because my interest is that llhttp is fast (and correct of course).

So regarding a performance check I use llhttp repo which has a simple benchmark in it.

Here How I did it:

  1. Cloned llparse and llhttp locally.
  2. Did the changes in llparse.
  3. Used npm-link so that my llocal llparse version is used in llhttp.
  4. Rebuild llhttp and run a benchmark.

So my data (20% and 11%) is for llhttp.

@richardlau
Copy link
Member

My question at this stage is whether any of this matters when built in Node.js because I don't see runtime dispatching anywhere (did I miss it?). So, in the current code, unless you compile with sse4.2 support, the fast path may not be called (ever).

That's correct. That would need to be done in the deps/llhttp first so we could, technically include Node.js, but I'm almost sure it would need a new dist creation (or at the very least - updating the current building flow) to have this support.

I assume this is a huge effort, right @richardlau?

I'm not entirely sure what is being asked. I'm not very familiar with llhttp and how it is generated using llparse -- for Node.js the gyp files used to build llhttp are:

@RafaelGSS
Copy link
Member

I'm not entirely sure what is being asked. I'm not very familiar with llhttp and how it is generated using llparse -- for Node.js the gyp files used to build llhttp are:

I meant, in case we want to support sse4.2, we'll need to build llhttp with that support on Node.js and considering this might be an arch-specific build, it should be a different distributed binary in the release section, right? If so, it should be hard to support I guess - mostly due to the maintenance burden.

@richardlau
Copy link
Member

I don't think we'd need an arch specific build -- I think usually the dependency (in this case llhttp) would have to to detect and select the code to use at runtime. I don't know how easy that will be to do in llhttp so "huge effort" may or not be correct (albeit I'd expect most of the effort to be on the llhttp side rather than the Node.js side).
sse4.2 in Node.js shows hits in base64 and zlib (although nodejs/node@ae115d6 commented this out for zlib).

@richardlau
Copy link
Member

And to follow up, the sse4.2 code would need to be in a separate compilation unit (i.e. .c file).

@lemire
Copy link
Member

lemire commented Nov 8, 2023

@richardlau Yes. You definitively do not want to pass an architecture specific flag to the compiler when building Node.js, unless you want to build your own private version.

the sse4.2 code would need to be in a separate compilation unit (i.e. .c file).

That's one way to do it. If you look at simdutf (which is already in Node.js), you will find that there is just one compilation unit (literally just one .cpp).

In any case, it is solvable here as well if we care.

@lemire
Copy link
Member

lemire commented Nov 8, 2023

If so, it should be hard to support I guess - mostly due to the maintenance burden.

See simdutf for a reference on how you can make it easy to build (just one .cpp or .c file would do).

@lemire
Copy link
Member

lemire commented Nov 8, 2023

@RafaelGSS Do you have a good http benchmark already that could be used if one wants to demonstrate benefits?

@RafaelGSS
Copy link
Member

@RafaelGSS Do you have a good http benchmark already that could be used if one wants to demonstrate benefits?

I'm not sure about the impacts of this optimization, but I believe https://github.com/nodejs/node/blob/main/benchmark/http/simple.js should be enough if that's a general parsing optimization. Using framework benchmarks should be good as well (see: https://github.com/fastify/benchmarks)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants