Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

respect 500~1000 linger.ms for high throughput but medium latency use cases - fire-and-forget #863

Open
ericsun2 opened this issue Nov 27, 2024 · 2 comments
Labels

Comments

@ericsun2
Copy link

 ProducerLinger sets how long individual topic partitions will linger waiting
 for more records before triggering a request to be built.

 Note that this option should only be used in low volume producers. The only
 benefit of lingering is to potentially build a larger batch to reduce cpu
 usage on the brokers if you have many producers all producing small amounts.

 If a produce request is triggered by any topic partition, all partitions
 with a possible batch to be sent are used and all lingers are reset.

 As mentioned, the linger is specific to topic partition. A high volume
 producer will likely be producing to many partitions; it is both unnecessary
 to linger in this case and inefficient because the client will have many
 timers running (and stopping and restarting) unnecessarily.

Let's say we have a high-volume topic with 60 partitions and 700MiB/sec peak ingress throughput.
We want to optimize the broker efficiency with bigger bath size. In theory, we expect 11MiB/sec per partition with 1MiB per chunk or batch. But Franz-Go typically sends each chunk with only 1KB~2KB size only, even if we set linger.ms to 1000.

Is there any way we can tweak Franz-Go to better batch up events into 4~6MB chunk before compression (and 1MB after compression)?

@twmb
Copy link
Owner

twmb commented Nov 28, 2024

Trying to estimate batch size after compression can only be done via a heuristic and can be fraught with problems. Worst case, the client estimates poorly and creates a compressed batch that is larger than the max batch bytes. Instead, the client buffers by uncompressed size and once linger or max batch size is hit, creates a batch -- compressing in the process.

If you want to try working around this from a user perspective, you could try increasing the max batch bytes -- e.g. if you know you have a pretty consistent 50% compression ratio, you could double the max batch bytes.

@twmb twmb added the waiting label Nov 28, 2024
@baganokodo2022
Copy link

baganokodo2022 commented Dec 1, 2024

Hi @twmb,

@ericsun2 and I have been using rand.Read(data) to generate random binary payloads for our tests, resulting in a limited compression ratio. With a publishing throughput of 600K messages per second, each 1KiB in size, the Kafka broker reports an ingestion rate of approximately 600 MiB per second for a 64-partition topic. On each partition, the ingestion speed is around 9 MiB or 9K messages per second.

To optimize batching, we expanded the following producer configurations:

ProducerLinger: Increased to 1s, 2s, 5s, and 10s.
ProducerBatchMaxBytes: Set to 1 MiB.
MaxBufferedRecords: Set to 1 million.

Our goal was to achieve a batched message size of approximately 1 MiB or 1K messages per batch. However, we observed that the NumRecords in a batch is capped at 76, significantly below the expected size.

While reviewing the Franz-go source code, I noticed that whenever sink.maybeDrain() triggers a call to createReq(), all recBufs are drained simultaneously. Even if only one recBuf reaches the maxRecordBatchBytes, the remaining recBufs are prematurely added to the request, leading to an early drain.

Is this behavior intentional, perhaps to improve throughput or reduce latency? I’m curious if my interpretation is correct.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants