-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Problem
connect_to_deepgram_with_backoff() in backend/utils/stt/streaming.py:403-425 has two critical bugs that caused the Mar 11 reconnection storm (and amplified the Mar 9 incident):
Bug 1: Blocking time.sleep() in async context
# streaming.py:423 — CURRENT (broken)
time.sleep(backoff_delay / 1000)Backend-listen runs single-worker uvicorn (one async event loop per pod). time.sleep() blocks the entire event loop during DG retry backoff (1-8s per attempt). When multiple connections retry concurrently:
- 10 concurrent retries = 30-50s event loop stall
- During stall: heartbeats stop, pusher pings timeout, all
websocket.receive()calls block - Result: ALL connections on the pod die, not just the failing ones
This is the key amplifier that converts isolated DG failures into pod-level cascade.
Bug 2: No client disconnect check during retry
connect_to_deepgram_with_backoff() has no is_active parameter — it retries blindly even after the client has disconnected (e.g. client timeout, user closes app). Each wasted retry blocks the event loop for 1-8s, starving still-active connections.
Compare with connect_to_trigger_pusher() in backend/utils/pusher.py:12-26 which already does both correctly:
# pusher.py — CORRECT pattern
async def connect_to_trigger_pusher(uid, ..., is_active=None):
for attempt in range(retries):
if is_active is not None and not is_active(): # abort if client gone
return None
try:
return await _connect_to_trigger_pusher(uid, sample_rate)
except Exception:
...
await asyncio.sleep(backoff_delay / 1000) # non-blockingFix
Make connect_to_deepgram_with_backoff match the pusher pattern:
time.sleep()→await asyncio.sleep()— make the functionasync def- Add
is_activecallback parameter — check before each retry attempt, abort if client disconnected - Update callers (
process_audio_dgat line 386) to passis_active=lambda: websocket_active
Impact
- P0 for reconnection storm prevention — this was the nonlinear tipping point in the Mar 11 incident
- Mar 10 had similar traffic spikes (up to ~4x baseline) but DG retries per pod stayed below the event-loop-starvation threshold (~4-5 concurrent). Mar 11 crossed it → pod-level cascade → 45-min sustained storm → thousands of 5xx errors across two waves
- Eliminating the blocking sleep removes the cascade amplifier entirely
- Aborting retries on client disconnect eliminates wasted DG connection attempts during storms
Related
- Part of reconnection storm fix set from Add terminationGracePeriodSeconds and preStop hook for backend-listen #5523 (DG connection limiter)
- Add rolling update strategy for backend-listen deployment #5522 (Flutter backoff + jitter)
- Add exponential backoff + jitter to WebSocket reconnection #5520 (PodDisruptionBudget)