Skip to content

Fix blocking time.sleep() in Deepgram retry + abort retry on client disconnect #5577

@beastoin

Description

@beastoin

Problem

connect_to_deepgram_with_backoff() in backend/utils/stt/streaming.py:403-425 has two critical bugs that caused the Mar 11 reconnection storm (and amplified the Mar 9 incident):

Bug 1: Blocking time.sleep() in async context

# streaming.py:423 — CURRENT (broken)
time.sleep(backoff_delay / 1000)

Backend-listen runs single-worker uvicorn (one async event loop per pod). time.sleep() blocks the entire event loop during DG retry backoff (1-8s per attempt). When multiple connections retry concurrently:

  • 10 concurrent retries = 30-50s event loop stall
  • During stall: heartbeats stop, pusher pings timeout, all websocket.receive() calls block
  • Result: ALL connections on the pod die, not just the failing ones

This is the key amplifier that converts isolated DG failures into pod-level cascade.

Bug 2: No client disconnect check during retry

connect_to_deepgram_with_backoff() has no is_active parameter — it retries blindly even after the client has disconnected (e.g. client timeout, user closes app). Each wasted retry blocks the event loop for 1-8s, starving still-active connections.

Compare with connect_to_trigger_pusher() in backend/utils/pusher.py:12-26 which already does both correctly:

# pusher.py — CORRECT pattern
async def connect_to_trigger_pusher(uid, ..., is_active=None):
    for attempt in range(retries):
        if is_active is not None and not is_active():  # abort if client gone
            return None
        try:
            return await _connect_to_trigger_pusher(uid, sample_rate)
        except Exception:
            ...
        await asyncio.sleep(backoff_delay / 1000)  # non-blocking

Fix

Make connect_to_deepgram_with_backoff match the pusher pattern:

  1. time.sleep()await asyncio.sleep() — make the function async def
  2. Add is_active callback parameter — check before each retry attempt, abort if client disconnected
  3. Update callers (process_audio_dg at line 386) to pass is_active=lambda: websocket_active

Impact

  • P0 for reconnection storm prevention — this was the nonlinear tipping point in the Mar 11 incident
  • Mar 10 had similar traffic spikes (up to ~4x baseline) but DG retries per pod stayed below the event-loop-starvation threshold (~4-5 concurrent). Mar 11 crossed it → pod-level cascade → 45-min sustained storm → thousands of 5xx errors across two waves
  • Eliminating the blocking sleep removes the cascade amplifier entirely
  • Aborting retries on client disconnect eliminates wasted DG connection attempts during storms

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend Task (python)bugSomething isn't workingp0Priority: Existential (score >=30)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions