Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Psync Zero-Copy Transmission #1335

Open
yairgott opened this issue Nov 21, 2024 · 0 comments
Open

Psync Zero-Copy Transmission #1335

yairgott opened this issue Nov 21, 2024 · 0 comments

Comments

@yairgott
Copy link

yairgott commented Nov 21, 2024

Problem Statement

In the current design, the primary maintains a replication buffer to record mutation commands for syncing the replicas. This replication buffer is implemented as a linked list of chunked buffers. The primary periodically transmits these recorded commands to each replica by issuing socket writes on the replica connections, which involve copying data from the user-space buffer to the kernel.
The transmission is performed by the writeToReplica function, which uses connWrite to send data over the socket.

This user-space to kernel buffer copy consumes CPU cycles and increases the memory footprint. The overhead becomes more noticeable when a replica lags significantly behind the primary, as pysnc triggers a transmission burst. This burst may temporarily reduce the primary's responsiveness, with excessive copying and potential TCP write buffer exhaustion being major contributing factors.

Proposal

Modern Linux systems support zero-copy transmission, which operates by:

  • The user-space provides a buffer directly to the kernel for transmission.
  • The kernel uses this buffer without duplicating it.
  • Transmission progress is communicated back to the user-space through notifications. epoll can be used as the notification mechanism.

The primary downside of zero-copy is the need for userspace to manage the send buffer. However, this limitation is much less applicable for the psync use case as Valkey already manages the pysnc replication buffers.

It’s important to note that using zero-copy for psync requires careful adjustments of the replica client write buffers management logic. Specifically, the logic to ensure that the total accumulated replication write buffer size, across all the replica connections, is limited to the value of client-output-buffer-limit replica.

Further reading on zero-copy can be found here.
Note that this article states that zero-copy is most effective for large payloads, and experimentation is necessary to determine the minimum payload size. For Memorystore vector search cluster communication, enabling zero-copy in gRPC improved QPS by approximately 8.6%.

Zero-Copy Beyond Psync

Zero-copy can also optimize transmission to clients. In the current implementation, dictionary entries are first copied into the client object's write buffer and then copied again during transmission to the client socket, resulting in two memory copies. Using zero-copy eliminates the client socket copy.
Similarly to the psync use case, implementing zero-copy for client transmission requires careful adjustments to the client’s write buffer management logic. The following considerations, while not exhaustive, outline key aspects to address:

  1. Adjustments to the accumulated client send buffer size upper limit logic may be required, similarly to the psync use case. Note that the upper limit for the accumulated client send buffer is controlled by the client-output-buffer-limit normal parameter.
  2. Currently, for large dictionary entries, transmission to the client involves multiple iterations of copying chunks of data from the dictionary entry into the client send buffer, and subsequently from the send buffer to the socket. If the socket's TCP send buffer becomes full, the copying process is suspended until some send buffer is available, preventing excessive memory usage.
    Since zero-copy doesn’t consume TCP buffers, excessive memory usage prevention must be handled differently. One approach is to defer copying the next portion of the dictionary entry until a confirmation is received that a significant part of the pending write buffer has been received by the replica.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant