You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am the maintainer of Telegraf and we use sarama (v1.42.1) to both collect and send metrics. We have a situation that has come to our attention, where a user is sending a number of batches of messages to a remote kafka server and getting throttled. From the sarama logs we see the throttling, however, after these messages nothing further from sarama is logged:
2024-01-24T18:30:27Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in1.363749194s2024-01-24T18:30:27Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in646.526732ms2024-01-24T18:30:28Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in579.534051ms2024-01-24T18:30:28Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in560.917867ms2024-01-24T18:30:29Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in552.540762ms2024-01-24T18:30:29Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in584.911966ms2024-01-24T18:30:30Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in601.034029ms2024-01-24T18:30:31Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in556.336685ms2024-01-24T18:30:31Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in594.819161ms2024-01-24T18:30:32Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in557.661247ms2024-01-24T18:30:32Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in612.071602ms2024-01-24T18:30:33Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in583.09675ms2024-01-24T18:30:34Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in691.045103ms2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 128ms2024-01-24T18:30:34Z D! [sarama] broker/11 waiting for throttle timer2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 8ms2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 52ms2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 86ms2024-01-24T18:30:34Z D! [sarama] broker/11*sarama.ProduceResponse throttled 266ms2024-01-24T18:30:35Z D! [sarama] broker/11*sarama.ProduceResponse throttled 307ms2024-01-24T18:30:35Z D! [sarama] broker/11*sarama.ProduceResponse throttled 395ms2024-01-24T18:30:35Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in1.153249261s2024-01-24T18:30:35Z D! [sarama] broker/11 waiting for throttle timer2024-01-24T18:30:35Z D! [sarama] broker/11*sarama.ProduceResponse throttled 530ms2024-01-24T18:30:36Z D! [sarama] broker/11*sarama.ProduceResponse throttled 38ms2024-01-24T18:30:36Z D! [outputs.kafka::remote_server] Wrote batch of 10000 metrics in1.039306873s2024-01-24T18:30:36Z D! [sarama] broker/11 waiting for throttle timer2024-01-24T18:30:36Z D! [sarama] broker/11*sarama.ProduceResponse throttled 611ms2024-01-24T18:30:36Z D! [sarama] broker/11 waiting for throttle timer2024-01-24T18:30:37Z D! [sarama] broker/11*sarama.ProduceResponse throttled 551ms2024-01-24T18:30:37Z D! [sarama] broker/11*sarama.ProduceResponse throttled 481ms2024-01-24T18:30:38Z D! [sarama] broker/11*sarama.ProduceResponse throttled 464ms2024-01-24T18:30:38Z D! [sarama] broker/11*sarama.ProduceResponse throttled 171ms2024-01-24T18:30:38Z D! [sarama] broker/11*sarama.ProduceResponse throttled 28ms2024-01-24T18:30:54Z W! [agent]["outputs.kafka::remote_server"] did not complete within its flush interval2024-01-24T18:30:54Z D! [outputs.kafka::remote_server] Buffer fullness: 89153 / 1000000 metrics
To provide a little background of Telegraf, we send data once a batch size amount of metrics are available. In this specific scenario, the user has multiple batches ready to go pretty quickly, so we send them all at once. This means our calls to sarama.SyncProducer.SendMessages can happen on top of each other.
I realize these calls should block, however the way the log messages are produced make me wonder if something is getting mixed up. I am also concerned that some lock is getting hit as no further attempts to send messages or logs are produced by sarama at this point. The message from Telegraf about not completing within its flush interval, means a call to send took longer than 10 seconds in this case and has not completed. We continue to get this message.
I am tempted to put a lock around the call to SendMessages to see if forcing one call at at a time helps here, but I wanted to see if anyone else had any ideas or thoughts on what might be at play here.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I am the maintainer of Telegraf and we use sarama (v1.42.1) to both collect and send metrics. We have a situation that has come to our attention, where a user is sending a number of batches of messages to a remote kafka server and getting throttled. From the sarama logs we see the throttling, however, after these messages nothing further from sarama is logged:
To provide a little background of Telegraf, we send data once a batch size amount of metrics are available. In this specific scenario, the user has multiple batches ready to go pretty quickly, so we send them all at once. This means our calls to
sarama.SyncProducer.SendMessages
can happen on top of each other.I realize these calls should block, however the way the log messages are produced make me wonder if something is getting mixed up. I am also concerned that some lock is getting hit as no further attempts to send messages or logs are produced by sarama at this point. The message from Telegraf about not completing within its flush interval, means a call to send took longer than 10 seconds in this case and has not completed. We continue to get this message.
I am tempted to put a lock around the call to
SendMessages
to see if forcing one call at at a time helps here, but I wanted to see if anyone else had any ideas or thoughts on what might be at play here.Thanks!
Beta Was this translation helpful? Give feedback.
All reactions