You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am the maintainer of Telegraf and we use sarama (v1.42.1) to both collect and send metrics. We have a situation that has come to our attention, where a user is sending a number of batches of messages to a remote kafka server and getting throttled. From the sarama logs we see the throttling, however, after these messages nothing further from sarama is logged:
2024-01-24T18:30:27Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 1.363749194s
2024-01-24T18:30:27Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 646.526732ms
2024-01-24T18:30:28Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 579.534051ms
2024-01-24T18:30:28Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 560.917867ms
2024-01-24T18:30:29Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 552.540762ms
2024-01-24T18:30:29Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 584.911966ms
2024-01-24T18:30:30Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 601.034029ms
2024-01-24T18:30:31Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 556.336685ms
2024-01-24T18:30:31Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 594.819161ms
2024-01-24T18:30:32Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 557.661247ms
2024-01-24T18:30:32Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 612.071602ms
2024-01-24T18:30:33Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 583.09675ms
2024-01-24T18:30:34Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 691.045103ms
2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 128ms2024-01-24T18:30:34Z D! [sarama] broker/11 waiting forthrottle timer
2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 8ms2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 52ms2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 86ms2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 266ms2024-01-24T18:30:35Z D! [sarama] broker/11*sarama.ProduceResponse throttled 307ms2024-01-24T18:30:35Z D! [sarama] broker/11*sarama.ProduceResponse throttled 395ms2024-01-24T18:30:35Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 1.153249261s
2024-01-24T18:30:35Z D! [sarama] broker/11 waiting forthrottle timer
2024-01-24T18:30:35Z D! [sarama] broker/11*sarama.ProduceResponse throttled 530ms2024-01-24T18:30:36Z D! [sarama] broker/11*sarama.ProduceResponse throttled 38ms2024-01-24T18:30:36Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in 1.039306873s
2024-01-24T18:30:36Z D! [sarama] broker/11 waiting forthrottle timer
2024-01-24T18:30:36Z D! [sarama] broker/11*sarama.ProduceResponse throttled 611ms2024-01-24T18:30:36Z D! [sarama] broker/11 waiting forthrottle timer
2024-01-24T18:30:37Z D! [sarama] broker/11*sarama.ProduceResponse throttled 551ms2024-01-24T18:30:37Z D! [sarama] broker/11*sarama.ProduceResponse throttled 481ms2024-01-24T18:30:38Z D! [sarama] broker/11*sarama.ProduceResponse throttled 464ms2024-01-24T18:30:38Z D! [sarama] broker/11*sarama.ProduceResponse throttled 171ms2024-01-24T18:30:38Z D! [sarama] broker/11*sarama.ProduceResponse throttled 28ms2024-01-24T18:30:54Z W! [agent] ["outputs.kafka::remote_server"] did not complete within its flush interval
2024-01-24T18:30:54Z D! [outputs.kafka::remote_server] Buffer fullness: 89153 / 1000000 metrics
To provide a little background of Telegraf, we send data once a batch size amount of metrics are available. In this specific scenario, the user has multiple batches ready to go pretty quickly, so we send them all at once. This means our calls to sarama.SyncProducer.SendMessages can happen on top of each other.
I realize these calls should block, however the way the log messages are produced make me wonder if something is getting mixed up. I am also concerned that some lock is getting hit as no further attempts to send messages or logs are produced by sarama at this point. The message from Telegraf about not completing within its flush interval, means a call to send took longer than 10 seconds in this case and has not completed. We continue to get this message.
I am tempted to put a lock around the call to SendMessages to see if forcing one call at at a time helps here, but I wanted to see if anyone else had any ideas or thoughts on what might be at play here.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I am the maintainer of Telegraf and we use sarama (v1.42.1) to both collect and send metrics. We have a situation that has come to our attention, where a user is sending a number of batches of messages to a remote kafka server and getting throttled. From the sarama logs we see the throttling, however, after these messages nothing further from sarama is logged:
To provide a little background of Telegraf, we send data once a batch size amount of metrics are available. In this specific scenario, the user has multiple batches ready to go pretty quickly, so we send them all at once. This means our calls to
sarama.SyncProducer.SendMessages
can happen on top of each other.I realize these calls should block, however the way the log messages are produced make me wonder if something is getting mixed up. I am also concerned that some lock is getting hit as no further attempts to send messages or logs are produced by sarama at this point. The message from Telegraf about not completing within its flush interval, means a call to send took longer than 10 seconds in this case and has not completed. We continue to get this message.
I am tempted to put a lock around the call to
SendMessages
to see if forcing one call at at a time helps here, but I wanted to see if anyone else had any ideas or thoughts on what might be at play here.Thanks!
Beta Was this translation helpful? Give feedback.
All reactions