-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use MSG_ZEROCOPY for plaintext replication traffic #1543
base: unstable
Are you sure you want to change the base?
Conversation
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
…regardless of tcp memory buffers Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1543 +/- ##
============================================
+ Coverage 70.77% 70.88% +0.11%
============================================
Files 120 121 +1
Lines 65005 65274 +269
============================================
+ Hits 46005 46268 +263
- Misses 19000 19006 +6
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some highlevel comments.
I still did not dive deep into the review of this, but it does sound promising.
I also wonder how this will look like when we do integrate io-uring.
there would be several options to go:
- no io-uring AND no ZERO-COPY support
- io-uring support but NO ZERO-COPY support
- both io-uring AND ZERO-COPY support
I think we would probably want to support all 3 modes, but we could also just invest in io-uring+ZERO copy support which could simplify the tracker code (am I correct in this observation?)
if (invert) { | ||
event_order[0] = AE_ERROR_QUEUE; | ||
event_order[1] = AE_WRITABLE; | ||
event_order[2] = AE_READABLE; | ||
} else { | ||
event_order[0] = AE_ERROR_QUEUE; | ||
event_order[1] = AE_READABLE; | ||
event_order[2] = AE_WRITABLE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (invert) { | |
event_order[0] = AE_ERROR_QUEUE; | |
event_order[1] = AE_WRITABLE; | |
event_order[2] = AE_READABLE; | |
} else { | |
event_order[0] = AE_ERROR_QUEUE; | |
event_order[1] = AE_READABLE; | |
event_order[2] = AE_WRITABLE; | |
} | |
event_order[0] = AE_ERROR_QUEUE; | |
event_order[1] = invert ? AE_WRITABLE : AE_READABLE; | |
event_order[0] = invert ? AE_READABLE : AE_WRITABLE; |
#define AE_READABLE 1 << 0 /* Fire when descriptor is readable. */ | ||
#define AE_WRITABLE 1 << 1 /* Fire when descriptor is writable. */ | ||
#define AE_BARRIER 1 << 2 /* With WRITABLE, never fire the event if the \ | ||
READABLE event already fired in the same event \ | ||
loop iteration. Useful when you want to persist \ | ||
things to disk before sending replies, and want \ | ||
to do that in a group fashion. */ | ||
#define AE_ERROR_QUEUE 1 << 3 /* Fire when descriptor has a message on the \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define AE_READABLE 1 << 0 /* Fire when descriptor is readable. */ | |
#define AE_WRITABLE 1 << 1 /* Fire when descriptor is writable. */ | |
#define AE_BARRIER 1 << 2 /* With WRITABLE, never fire the event if the \ | |
READABLE event already fired in the same event \ | |
loop iteration. Useful when you want to persist \ | |
things to disk before sending replies, and want \ | |
to do that in a group fashion. */ | |
#define AE_ERROR_QUEUE 1 << 3 /* Fire when descriptor has a message on the \ | |
#define AE_READABLE (1 << 0) /* Fire when descriptor is readable. */ | |
#define AE_WRITABLE (1 << 1) /* Fire when descriptor is writable. */ | |
#define AE_BARRIER (1 << 2) /* With WRITABLE, never fire the event if the \ | |
READABLE event already fired in the same event \ | |
loop iteration. Useful when you want to persist \ | |
things to disk before sending replies, and want \ | |
to do that in a group fashion. */ | |
#define AE_ERROR_QUEUE (1 << 3) /* Fire when descriptor has a message on the \ |
@@ -3289,6 +3303,7 @@ standardConfig static_configs[] = { | |||
createIntConfig("rdma-port", NULL, MODIFIABLE_CONFIG, 0, 65535, server.rdma_ctx_config.port, 0, INTEGER_CONFIG, NULL, updateRdmaPort), | |||
createIntConfig("rdma-rx-size", NULL, IMMUTABLE_CONFIG, 64 * 1024, 16 * 1024 * 1024, server.rdma_ctx_config.rx_size, 1024 * 1024, INTEGER_CONFIG, NULL, NULL), | |||
createIntConfig("rdma-completion-vector", NULL, IMMUTABLE_CONFIG, -1, 1024, server.rdma_ctx_config.completion_vector, -1, INTEGER_CONFIG, NULL, NULL), | |||
createIntConfig("tcp-zerocopy-min-write-size", NULL, MODIFIABLE_CONFIG, 0, INT_MAX, server.tcp_zerocopy_min_write_size, CONFIG_DEFAULT_ZERO_COPY_MIN_WRITE_SIZE, INTEGER_CONFIG, NULL, NULL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this falls under the "configurations we do not want users to mess with". agree 10K is the recommended sweet spot, but not sure if this needs any kind of tunability option ATM.
Consider dropping this config for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also didn't have it at first - but in tests it was much easier to deterministically trigger zero copy writes if I can play with the minimum write size.
configurations we do not want users to mess with
I can move it to a DEBUG command if we would prefer, if we have concerns about maintaining new configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why it makes the test much easier? I thiknk DEBUG command is fine and also a HIDDEN config which is named prefixed/suffixed with 'debug' is fine IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why it makes the test much easier?
Ultimately we have no control over if a call to send
will return <= the number of bytes we want it to send. So this leads to nondeterministic behavior - we can call SET
with a key of length 10 KiB, and sometimes see that this doesn't trigger a zero-copy write since it is pieced into two send
syscalls.
I think this falls under the "configurations we do not want users to mess with".
At a higher level - can I ask why it would be bad to expose such a config to the user? I can imagine that a combination of network hardware and kernel version may impact what setting makes sense here. I don't expect many folks to play with it - but it isn't a confusing setting and it isn't like maintenance cost will be high.
size_t data_len = o->used - c->repl_data->ref_block_pos; | ||
int use_zerocopy = shouldUseZeroCopy(c->conn, data_len); | ||
if (use_zerocopy) { | ||
/* Lazily enable zero copy at the socket level only on first use */ | ||
if (!c->zero_copy_tracker) { | ||
connSetZeroCopy(c->conn, 1); | ||
c->zero_copy_tracker = createZeroCopyTracker(); | ||
} | ||
nwritten = zeroCopyWriteToConn(c->conn, o->buf + c->repl_data->ref_block_pos, data_len); | ||
} else { | ||
nwritten = connWrite(c->conn, o->buf + c->repl_data->ref_block_pos, data_len); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it could be encapsulated in the connection abstraction maybe? I wonder if the tracker should be a client or a connection owned? I think it makes more sense that it will be part of the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - on my initial pass I was thinking that the tracker fit better as a client-level abstraction simply because the logic is heavily tied to the replication backlog, and having the connection deal with replication backlog mechanics felt like a bad smell. But perhaps it would be better to start with a tracker that supports a more generic callback-based approach:
typedef void (*ZeroCopyWriteCompleteCallback)(void *);
/* Tracking struct used to trigger cleanup once a zero-copy write is finished */
typedef struct zeroCopyRecord {
int active;
ZeroCopyWriteCompleteCallback callback;
void *privdata;
} zeroCopyRecord;
Then maybe connection exposes an API like:
int connWriteZeroCopy(connections *conn, char *bug, size_t len, ZeroCopyWriteCompleteCallback callback, void *privdata) {
/* Do write and enqueue zeroCopyRecord */
/* Later, when zero copy is done, connection will trigger the ZeroCopyWriteCompleteCallback with privdata */
}
Is this what you had in mind?
Yeah I believe the polling mechanism will be different:
So the difference would be in having one vs. two handlers, but ideally once the completion notification is read we can merge into a common format and process it similarly for both. There are limitations regarding io-uring adoption (https://www.phoronix.com/news/Google-Restricting-IO_uring) so having both modes would be ideal for those who are unable/unwilling to adopt IO uring. |
Summary
This PR integrates with
MSG_ZEROCOPY
for plaintext replication traffic. MSG_ZEROCOPY is a Linux kernel functionality that allows writes to avoid copying data from user space to kernel space. Instead,MSG_ZEROCOPY
allows the kernel to pin the user space data into kernel space for asynchronous writing without any copying. In turn, users have to track the ongoing writes and ensure the data is kept stable, listening on the socket's message queue for a completion notification.Design
We track the ongoing writes through a dynamic circular buffer,
zeroCopyTracker
. Each ongoing write is indexed by its sequence number, matching the one assigned by the kernel, with each entry holding a refcount into the replication backlog. We also add a new event type to the event loop APIs (AE_ERROR_QUEUE
) to register forSO_EE_ORIGIN_ZEROCOPY
notifications from the kernel. When we choose to use zero-copy for writes, we append a tracking entry to the circular buffer and register for the AE_ERROR_QUEUE event, which later fires and notifies us of a batch of completions, allowing us to trim the tracker which in turn decrements the refcount and trims the replication backlog.To maintain a net improvement on performance,
MSG_ZEROCOPY
is only used in situations where it should perform better than our typicalwrite
syscalls:MSG_ZEROCOPY
is only used for writes that are over 10 KiB (by default, user configurable)MSG_ZEROCOPY
is only used for remote connectionsMSG_ZEROCOPY
is only used for plaintext connectionsThis PR enables
MSG_ZEROCOPY
as the new default behavior for writes from the replication backlog that meet the aforementioned criteria on builds that support it. It also introduces a config to disable it altogether via--tcp-tx-zerocopy no
.Although this PR only implements it for replication, the core concept should be reusable at various other places where we do writes, so long as reference counting can be used to defer cleanup until a later
epoll
cycle is notified of the write completion. The prospects of combining this with something like IO uring via IORING_OP_SENDZC are especially exciting. We should aim to continue to evolve the core concept of thezeroCopyTracker
to meet these needs.Some other notable call outs in the implementation:
close
operation, we instead go into a draining phase where we first callshutdown
, then await the zero copy tracker to reach zero length before later callingclose
and freeing the client in entirety.zeroCopyTracker
.zeroCopyTracker
. A single replication link with 4 billion writes is not unexpected on a long-running instance.Performance
Copying and pasting some initial performance comparisons from #1335:
closes #1335