Description
Describe the bug
After restarting JetStream (EventBus), Argo Events (Sensor) starts producing an excessive number of errors—over 200,000 errors in a few minutes. This significantly impacts system performance and stability. The issue is consistently reproducible, and logs indicate a flood of reconnection or message processing errors.
To Reproduce
Steps to reproduce the behavior:
- Deploy JetStream EventBus
- Deploy Sensor that will be connected to that eventbus
- Update event bus with any information like tolerations or affinity (then evetnbus will start rollout one by one pod)
- Check sensor logs or check log volume on grafana with loki :)
Expected behavior
Argo Events should gracefully handle JetStream restarts without producing an overwhelming number of errors. It should retry connections in a controlled manner rather than flooding logs and potentially overloading the system.
Screenshots
Log volume of one sensor:
Example of log:
{"level":"error","ts":1740569807.8508778,"logger":"argo-events.sensor","caller":"sensor/trigger_conn.go:202","msg":"failed to fetch messages for subscription &{mu:{state:0 sema:0} sid:2 Subject:_INBOX.vg8Q0Diy7UIus5nuVGjAcr.* Queue: jsi:0xc0015a2000 delivered:183 max:0 conn:0xc001501508 mcb:<nil> mch:<nil> errCh:<nil> closed:true sc:false connClosed:true draining:false status:0 statListeners:map[] permissionsErr:<nil> typ:1 pHead:<nil> pTail:<nil> pCond:<nil> pDone:<nil> pMsgs:0 pBytes:0 pMsgsMax:2 pBytesMax:0 pMsgsLimit:65536 pBytesLimit:67108864 dropped:0}, nats: invalid subscription\nnats: subscription closed, previousErr=nats: invalid subscription\nnats: subscription closed, previousErrTime=2025-02-26 11:36:47.850854739 +0000 UTC m=+435273.783129224","sensorName":"my-sensor-namegsm","triggerName":"my-sensor-namegsm","sensorName":"my-sensor-namegsm","stacktrace":"github.com/argoproj/argo-events/pkg/eventbus/jetstream/sensor.(*JetstreamTriggerConn).pullSubscribe\n\t/home/runner/work/argo-events/argo-events/pkg/eventbus/jetstream/sensor/trigger_conn.go:202"}
Environment (please complete the following information):
- Kubernetes: v1.27.7
- Argo: 3.6.4
- Argo Events: 1.9.5
- JetStream version: 2.10.10
Additional context
Honestly, I wouldn't even have noticed if I didn't have a few dozen sensors and a Loki that had to collect several million logs in those few minutes.
Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
Activity