Fix blocking time.sleep() in Deepgram retry + abort retry on client disconnect

## Problem

`connect_to_deepgram_with_backoff()` in `backend/utils/stt/streaming.py:403-425` has two critical bugs that caused the Mar 11 reconnection storm (and amplified the Mar 9 incident):

### Bug 1: Blocking `time.sleep()` in async context

```python
# streaming.py:423 — CURRENT (broken)
time.sleep(backoff_delay / 1000)
```

Backend-listen runs **single-worker uvicorn** (one async event loop per pod). `time.sleep()` blocks the entire event loop during DG retry backoff (1-8s per attempt). When multiple connections retry concurrently:
- 10 concurrent retries = 30-50s event loop stall
- During stall: heartbeats stop, pusher pings timeout, all `websocket.receive()` calls block
- Result: ALL connections on the pod die, not just the failing ones

This is the key amplifier that converts isolated DG failures into pod-level cascade.

### Bug 2: No client disconnect check during retry

`connect_to_deepgram_with_backoff()` has no `is_active` parameter — it retries blindly even after the client has disconnected (e.g. client timeout, user closes app). Each wasted retry blocks the event loop for 1-8s, starving still-active connections.

Compare with `connect_to_trigger_pusher()` in `backend/utils/pusher.py:12-26` which already does both correctly:
```python
# pusher.py — CORRECT pattern
async def connect_to_trigger_pusher(uid, ..., is_active=None):
    for attempt in range(retries):
        if is_active is not None and not is_active():  # abort if client gone
            return None
        try:
            return await _connect_to_trigger_pusher(uid, sample_rate)
        except Exception:
            ...
        await asyncio.sleep(backoff_delay / 1000)  # non-blocking
```

## Fix

Make `connect_to_deepgram_with_backoff` match the pusher pattern:

1. **`time.sleep()` → `await asyncio.sleep()`** — make the function `async def`
2. **Add `is_active` callback parameter** — check before each retry attempt, abort if client disconnected
3. **Update callers** (`process_audio_dg` at line 386) to pass `is_active=lambda: websocket_active`

## Impact

- **P0 for reconnection storm prevention** — this was the nonlinear tipping point in the Mar 11 incident
- Mar 10 had similar traffic spikes (up to ~4x baseline) but DG retries per pod stayed below the event-loop-starvation threshold (~4-5 concurrent). Mar 11 crossed it → pod-level cascade → 45-min sustained storm → thousands of 5xx errors across two waves
- Eliminating the blocking sleep removes the cascade amplifier entirely
- Aborting retries on client disconnect eliminates wasted DG connection attempts during storms

## Related

- Part of reconnection storm fix set from #5523 (DG connection limiter)
- #5522 (Flutter backoff + jitter)
- #5520 (PodDisruptionBudget)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix blocking time.sleep() in Deepgram retry + abort retry on client disconnect #5577

Problem

Bug 1: Blocking `time.sleep()` in async context

Bug 2: No client disconnect check during retry

Fix

Impact

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fix blocking time.sleep() in Deepgram retry + abort retry on client disconnect #5577

Description

Problem

Bug 1: Blocking time.sleep() in async context

Bug 2: No client disconnect check during retry

Fix

Impact

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug 1: Blocking `time.sleep()` in async context