Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Multithreading for Quiche Send and Recv #54

Open
CMCDragonkai opened this issue Sep 25, 2023 · 2 comments
Open

Native Multithreading for Quiche Send and Recv #54

CMCDragonkai opened this issue Sep 25, 2023 · 2 comments
Labels
development Standard development r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Sep 25, 2023

Specification

Quiche's connection send and recv involve processing 1 or more QUIC packets (1 to 1 to a dgram packet) which each can contain multiple QUIC frames.

image

The processing of a single packet ends up dispatching into single connection to handle this data.

The QUICSocket.handleSocketMessage therefore needs to run as fast as possible in order to drain the kernel's UDP receive buffer.

However atm, because all of the quiche operations run on the main thread, all of its work is done synchronously and thus blocks Node's main thread.

This can be quite heavy because processing the received frames involves cryptographic decryption, and also immediately sending back frames that involve cryptographic encryption.

JS multi-threading can be quite slow, with overheads around 1.4ms: MatrixAI/js-workers#1 (comment). Meaning you need an operation that takes longer than 1.4ms to be worth it. Plus this will need to use the zero-copy transfer capability to ensure buffers are shared.

┌───────────────────────┐
│      Main Thread      │
│                       │
│     ┌───────────┐     │
│     │Node Buffer│     │
│     └─────┬─────┘     │          ┌────────────────────────┐
│           │           │          │                        │
│     Slice │ Copy      │          │      Worker Thread     │
│           │           │          │                        │
│  ┌────────▼────────┐  │ Transfer │  ┌──────────────────┐  │
│  │Input ArrayBuffer├──┼──────────┼──►                  │  │
│  └─────────────────┘  │          │  └─────────┬────────┘  │
│                       │          │            │           │
│                       │          │    Compute │           │
│                       │          │            │           │
│  ┌─────────────────┐  │          │  ┌─────────▼────────┐  │
│  │                 ◄──┼──────────┼──┤Output ArrayBuffer│  │
│  └─────────────────┘  │ Transfer │  └──────────────────┘  │
│                       │          │                        │
│                       │          │                        │
└───────────────────────┘          └────────────────────────┘

Native multi-threading is likely to be faster. The napi-rs bridge offers the ability to create native OS pthreads, which is separate from the libuv thread pool in nodejs because libuv thread pool is intended for IO, and quiche's operations is all CPU bound. However benching this will be important to understand how fast the operations are.

Naive benchmarks of quiche recv and send pairs between client and server indicated these levels of performance:

Optimised native code is 250000 iterations per second
Optimised FFI native code is 35000 ops per second
Optimised js-quic only 1896

Each iteration is 2 recv and send pairs, it is 2 because of both client and server.

The goal is to get js-quic code as close as possible to FFI native code, and perhaps exceed it.

Another source of slowdowns might be the FFI of napi-rs itself. This would be a separate problem though.

Additional context

Tasks

  1. Investigate napi-rs threading
  2. Create a separate thread on the rust side, and run the quiche connection in it
  3. Put all quiche connection processing into a single separate thread, don't create a new thread for each quiche connection, the goal here is to primarily unblock the node's main thread from processing the quiche operations, so that node can continue processing the IO
  4. Benchmark with a single separate thread
  5. Now create a thread pool, and send every quiche connection execution to the thread pool.
  6. If thread pool is used, it is likely all quiche connection memory has to be shared, rather than being pinned to a long-running thread
  7. If you are using a threadpool, this might get complicated to manage, as this means the native code has state... if this is too complicated try js-workers to see if 1.4ms is still worth it?
@CMCDragonkai CMCDragonkai added the development Standard development label Sep 25, 2023
@CMCDragonkai
Copy link
Member Author

1500 operations per second right now on QUIC, and 15000 ops per second for just FFI to quiche. Plenty of room for optimisations.

One of the big things is optimising just how many send calls are needed.

I think it would be a good idea to add a perf hooks to the handleSocketMessage to measure exactly how long it takes for 1 execution of that.

@CMCDragonkai
Copy link
Member Author

Oh and don't forget about adjusting the socket buffer. The longer the handleSocketMessage takes, the more likely data in the UDP socket is being dropped. It drops the most recent data, thus forcing re-sending from the QUIC protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

No branches or pull requests

1 participant