Skip to content

Commit 96124d5

Browse files
committed
Fixed prod issue
1 parent c49d392 commit 96124d5

5 files changed

Lines changed: 316 additions & 5 deletions

File tree

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Fix: Production SSE Stream Never Returns (2026.3.8 → 2026.3.12)
2+
3+
## Context
4+
5+
Between tags `2026.3.8` and `2026.3.12`, production streaming broke: POST `/llm/stream` returns 202 with `run_id` successfully, but the SSE stream from `GET /threads/{thread_id}/stream?run_id=X` never delivers data to the frontend. This only manifests in production (Redis 7, distributed workers via TaskIQ).
6+
7+
PR #862 (merged after `2026.3.12` into `2026.3.12.1`) fixed worker crash/reconnect issues (`ResilientRedisStreamBroker`, `_RunScopedStore`, `WorkerState` singleton). **Those fixes address worker-side failures but do NOT fix the root cause: a race condition between the frontend connecting and the worker writing its first event.**
8+
9+
## Root Cause
10+
11+
**Race condition in `stream_thread` route (thread.py:267-277):**
12+
13+
1. `POST /llm/stream` writes `stream_status: "running"` to thread metadata, enqueues TaskIQ task, returns 202
14+
2. Frontend immediately connects to `GET /threads/{thread_id}/stream?run_id=X`
15+
3. Route calls `distributed_stream_exists()` — checks if Redis stream key exists
16+
4. Worker hasn't picked up the task yet → no Redis stream key → returns **false**
17+
5. Route checks `active_run_id == run_id AND stream_status == "running"` → returns **503** (retryable)
18+
6. **BUT** if the thread metadata write from step 1 hasn't propagated (new thread, DB lag), or if `active_run_id` doesn't match, it falls through to **404** (NOT retryable)
19+
7. Frontend's `classifyError` (streamError.ts:49) only retries 500-503. **404 is terminal** → stream dies silently
20+
21+
The irony: `stream_from_redis()` already has a 30-second wait loop for the stream to appear (stream.py:437-446), but the route's pre-check at line 267 prevents ever reaching it.
22+
23+
## Fixes (ordered by priority)
24+
25+
### Fix 1 (P0): Remove premature stream existence pre-check
26+
**File:** `backend/src/routes/v0/thread.py` (lines 267-277)
27+
28+
When `active_run_id == run_id` and `stream_status == "running"`, skip the `distributed_stream_exists()` check entirely and proceed directly to `stream_from_redis()`. The internal wait loop handles the stream not existing yet.
29+
30+
Only check `distributed_stream_exists()` when `stream_status != "running"` (for replaying completed streams still within Redis TTL).
31+
32+
```
33+
# Pseudocode for the new logic:
34+
if active_run_id == run_id and stream_status == "running":
35+
# Worker is active — let stream_from_redis() wait for the stream
36+
return StreamingResponse(stream_from_redis(...))
37+
38+
# For completed/errored/aborted streams, check if Redis still has the data
39+
stream_exists = await distributed_stream_exists(thread_id, run_id)
40+
if not stream_exists:
41+
raise HTTPException(404, ...)
42+
43+
return StreamingResponse(stream_from_redis(...))
44+
```
45+
46+
### Fix 2 (P0 safety net): Make 404 retryable for distributed streams
47+
**File:** `frontend/src/lib/utils/streamSource.ts` (lines 157-186)
48+
49+
In `startWithRetry()`, treat 404 as retryable for the first 3 attempts. After a fresh POST /llm/stream 202, a 404 on the stream endpoint almost certainly means the worker hasn't created the key yet.
50+
51+
```
52+
// In startWithRetry, modify the canRetry logic:
53+
const is404Race = status === 404 && attempt < 3;
54+
const canRetry = isError && (is404Race || this.isRecoverableClose(error) || isRetryableError(error, status)) && attempt < MAX_ATTEMPTS - 1;
55+
```
56+
57+
### Fix 3 (P1): Add max lifetime timeout to stream_from_redis
58+
**File:** `backend/src/utils/stream.py` (line 448+)
59+
60+
The `while True` loop has no exit condition other than receiving `done`/`error` from Redis. If the worker crashes mid-stream, the consumer hangs forever.
61+
62+
Add:
63+
- `MAX_STREAM_LIFETIME_SECONDS` env var (default 600 = 10 minutes)
64+
- Track `start_time = time.monotonic()` before the loop
65+
- Check elapsed time each iteration; if exceeded, yield error + `[DONE]` and return
66+
- Also track consecutive empty XREAD results; after 5 consecutive (= 5 minutes of silence), yield error and close
67+
68+
### Fix 4 (P2): Stale stream_status cleanup
69+
**File:** `backend/src/routes/v0/thread.py`
70+
71+
If `stream_status == "running"` but `active_stream_started_at` is older than `MAX_STREAM_LIFETIME_SECONDS`, the worker is dead. Update `stream_status` to `"error"` and return 410 Gone.
72+
73+
This is a read-side check — no background task needed.
74+
75+
## Files to Modify
76+
77+
| File | Change |
78+
|------|--------|
79+
| `backend/src/routes/v0/thread.py` | Fix 1: Restructure stream existence check. Fix 4: Stale status detection |
80+
| `backend/src/utils/stream.py` | Fix 3: Add max lifetime timeout to `stream_from_redis` |
81+
| `frontend/src/lib/utils/streamSource.ts` | Fix 2: Make 404 retryable for early attempts |
82+
83+
## Verification
84+
85+
1. **Manual test:** Submit a chat in distributed mode, verify stream connects and delivers data
86+
2. **Race condition test:** Add artificial delay before worker writes initializing event; verify frontend retries and eventually connects
87+
3. **Timeout test:** Kill worker mid-stream; verify `stream_from_redis` terminates after timeout instead of hanging forever
88+
4. **Existing tests:** Run `make test` (backend) and `npm run test` (frontend) to verify no regressions
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Plan: Resolve Production Worker Stability & SSE Streaming Issues
2+
3+
## Context
4+
5+
After 19+ hours of uptime, the production worker accumulates orphaned asyncio tasks and Pydantic serializer warnings. Additionally, both worker processes crash simultaneously due to unhandled Redis `ConnectionError` in the TaskIQ broker's `listen()` loop. These issues compound to cause SSE streaming responses not being delivered to users in production, despite healthy backend health checks. Four GitHub issues are needed.
6+
7+
---
8+
9+
## Issue 1: Redis Connection Drops Kill Worker Processes (CRITICAL)
10+
11+
**Problem:** `RedisStreamBroker.listen()` in taskiq-redis has **no error handling** for `ConnectionError`. When Redis closes the connection (server restart, network blip, idle timeout), the exception propagates up through TaskIQ's receiver, killing the worker process. Both worker-0 and worker-1 crash simultaneously. TaskIQ's process manager restarts them, but any in-flight tasks are lost and streams are never completed.
12+
13+
**Evidence:** Logs show `redis.exceptions.ConnectionError: Connection closed by server.` causing `ExceptionGroup: unhandled errors in a TaskGroup` in both workers.
14+
15+
**Note:** `ListQueueBroker.listen()` (line 135 of `redis_broker.py`) already handles this with `except ConnectionError: continue`, but `RedisStreamBroker` does not.
16+
17+
**Fix:** Configure Redis connection pool with `retry_on_error` and `health_check_interval`, and add a custom broker subclass with connection error handling.
18+
19+
### Files to modify
20+
21+
1. **`backend/src/workers/broker.py`** — Subclass `RedisStreamBroker` to add `ConnectionError` handling in `listen()`, and pass `retry_on_error` + `health_check_interval` kwargs to the connection pool
22+
23+
### Implementation details
24+
25+
```python
26+
from redis.asyncio import Redis
27+
from redis.exceptions import ConnectionError
28+
from taskiq_redis import RedisStreamBroker
29+
30+
class ResilientRedisStreamBroker(RedisStreamBroker):
31+
"""RedisStreamBroker with ConnectionError retry handling."""
32+
33+
async def listen(self):
34+
"""Listen with automatic reconnect on ConnectionError."""
35+
while True:
36+
try:
37+
async for message in super().listen():
38+
yield message
39+
except ConnectionError as exc:
40+
logger.warning("Redis connection error in broker listen: %s. Reconnecting...", exc)
41+
await asyncio.sleep(1) # Brief backoff before reconnect
42+
continue
43+
44+
broker = ResilientRedisStreamBroker(
45+
url=REDIS_URL,
46+
queue_name="orchestra_tasks",
47+
health_check_interval=30, # Ping Redis every 30s to detect dead connections
48+
retry_on_error=[ConnectionError],
49+
).with_result_backend(result_backend)
50+
```
51+
52+
---
53+
54+
## Issue 2: Orphaned Asyncio Tasks in Worker (LangGraph Store Batch Loop)
55+
56+
**Problem:** Each worker task creates a new `AsyncPostgresStore` via `get_store_db()`. The store's internal batch loop task (`asyncio.create_task(_run())`) is not properly awaited on context manager exit, leaving orphaned tasks that accumulate over hours.
57+
58+
**Fix:** Add a singleton store to `WorkerState`, mirroring the existing checkpointer pattern in `state.py`.
59+
60+
### Files to modify
61+
62+
1. **`backend/src/workers/state.py`** — Add `_store` field, `_store_exit_stack`, initialize in `initialize()`, cleanup in `shutdown()`, add `get_store()` and `get_worker_store()` convenience function
63+
2. **`backend/src/workers/tasks.py`** — Replace `async with get_store_db() as store:` with `store = await get_worker_store()`
64+
3. **`backend/src/workers/broker.py`** — Ensure lifecycle hooks call store init/shutdown (already structured for this)
65+
66+
### Implementation details
67+
68+
- Use `contextlib.AsyncExitStack` to manage the store context manager lifecycle in `WorkerState`
69+
- On shutdown, exit the stack (which calls `__aexit__`), then cancel the store's internal `_task` and await with timeout as defense-in-depth
70+
- Keep `get_store_db()` in `db.py` unchanged for non-worker callers (API routes)
71+
72+
---
73+
74+
## Issue 3: Pydantic Serializer Warnings
75+
76+
**Problem:** Worker produces repeated Pydantic serializer warnings from legacy `.dict()` calls and `model_dump()` on LangChain `BaseMessage` objects (which extend `Serializable`, not Pydantic `BaseModel`).
77+
78+
**Fix:** Update serialization calls to use correct methods per object type.
79+
80+
### Files to modify
81+
82+
1. **`backend/src/agents/__init__.py`** (lines 64, 104) — Replace `memory.dict()` with `memory.model_dump()` or access `.value` directly
83+
2. **`backend/src/repos/thread_repo.py`** (line 67) — Replace `.dict()` with `.model_dump()`
84+
3. **`backend/src/utils/messages.py`** (line 20) — Use LangChain's `message_to_dict()` for `BaseMessage` instead of `.model_dump()`
85+
4. **`backend/src/utils/stream.py`** (line 62) — Same fix in `_to_dict()`: type-check for `BaseMessage` before calling serialization
86+
87+
---
88+
89+
## Issue 4: Production SSE Streaming Not Returning Responses
90+
91+
**Problem:** Production uses distributed mode (nginx + Redis streams). Responses not reaching users despite healthy backend. Root causes are a combination of Issues 1-3 plus missing nginx buffering headers and stream race conditions.
92+
93+
**Fix:** Multiple targeted changes to eliminate buffering and race conditions.
94+
95+
### Files to modify
96+
97+
1. **`backend/src/routes/v0/llm.py`** (line 150) — Add `"X-Accel-Buffering": "no"` header to sync-mode `StreamingResponse` (already present on distributed endpoint in `thread.py:285`)
98+
2. **`backend/src/workers/tasks.py`** (~line 105) — Write an "initializing" event to Redis stream immediately on task start, before heavy init work, to eliminate the race where client polls before stream exists
99+
3. **`backend/src/utils/stream.py`** — In `stream_from_redis()`, add a wait-loop when stream doesn't exist yet (up to 30s with keep-alive comments) instead of immediately raising `FileNotFoundError`
100+
101+
---
102+
103+
## Deployment Order
104+
105+
1. **Issue 1 (Redis resilience)** — CRITICAL. Workers crashing is the most likely cause of missing responses. Lowest risk fix (subclass + config).
106+
2. **Issue 4 (SSE headers + race conditions)** — High user impact, low risk. `X-Accel-Buffering` header is a one-liner.
107+
3. **Issue 3 (Pydantic warnings)** — Low risk, reduces log noise, aids debugging.
108+
4. **Issue 2 (Singleton store)** — Medium risk, fixes long-running stability. Deploy after validating Issue 1 fix.
109+
110+
## Verification
111+
112+
- **Issue 1:** Simulate Redis restart (`docker restart orchestra-dev-redis-1`), verify workers reconnect without crashing
113+
- **Issue 2:** Run worker for extended period, monitor for "Task was destroyed" warnings via `docker logs`
114+
- **Issue 3:** Check worker logs for absence of Pydantic serializer warnings
115+
- **Issue 4:** Send chat message in production, verify SSE stream delivers response. Test with `curl -N` behind nginx.
116+
- All: `make test` passes, `make format` clean
117+
118+
## GitHub Issues
119+
120+
Create 4 separate issues on the repo, each referencing the relevant files and fix approach above. Label with `bug` and `backend`.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Plan: Fix #858 — Redis connection drops kill worker processes
2+
3+
## Context
4+
5+
Issue [#858](https://github.com/ruska-ai/orchestra/issues/858): When Redis closes connections (restart, network blip, idle timeout), `RedisStreamBroker.listen()` raises an unhandled `ConnectionError` that propagates up and kills the worker processes. Both worker-0 and worker-1 crash simultaneously.
6+
7+
The fix is **already implemented** as unstaged changes on `development`. The changes also include related improvements to worker resilience and streaming reliability that were developed alongside the broker fix.
8+
9+
## Changes (already implemented, need commit + PR)
10+
11+
### 1. `backend/src/workers/broker.py` — Core fix for #858
12+
- Subclasses `RedisStreamBroker` as `ResilientRedisStreamBroker`
13+
- Wraps `listen()` with `ConnectionError` catch → logs warning → 1s backoff → reconnects
14+
- Swaps `broker` instantiation to use the new subclass
15+
16+
### 2. `backend/src/utils/stream.py` — Stream polling for worker init lag
17+
- Instead of immediately raising `FileNotFoundError` when the Redis stream doesn't exist, polls up to 30s (1s interval) sending SSE `: waiting\n\n` keepalive comments
18+
- Prevents client-side errors when the worker is still initializing
19+
20+
### 3. `backend/src/workers/state.py` — Shared `AsyncPostgresStore` singleton
21+
- Adds `_store` and `_store_exit_stack` to `WorkerState` for a worker-level store singleton
22+
- Adds `get_store()` class method and `get_worker_store()` convenience function
23+
- Properly shuts down store (exit stack + internal task cancellation) in `shutdown()`
24+
25+
### 4. `backend/src/workers/tasks.py` — Use worker-level store + early stream init
26+
- Writes an `initializing` event to the Redis stream immediately so clients see activity before heavy init
27+
- Uses `get_worker_store()` instead of per-task `get_store_db()` context manager (eliminates per-task DB connection overhead)
28+
29+
### 5. `backend/src/routes/v0/llm.py` — Disable proxy buffering
30+
- Adds `X-Accel-Buffering: no` header to SSE responses to prevent nginx/reverse proxy buffering
31+
32+
## Steps to complete
33+
34+
1. **Run tests**: `cd backend && make test` — verify all existing tests pass with these changes
35+
2. **Run format/lint**: `cd backend && make format && make lint`
36+
3. **Commit** all 4 changed files with a descriptive message referencing #858
37+
4. **Create PR** targeting `development`
38+
39+
## Verification
40+
41+
1. Run `make test` — all backend tests pass
42+
2. Manual: `docker restart orchestra-dev-redis-1` while workers are running — workers should log a warning and reconnect within ~1s, not crash
43+
3. Manual: Start a streaming request immediately after kicking a worker task — the client should receive `: waiting` keepalive comments until the stream appears
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Fix: SSE Stream URL Construction Breaks in Production (Distributed Mode)
2+
3+
## Context
4+
5+
Since tag `2026.3.12`, submitting a chat message on **chat.ruska.ai** creates a thread but never redirects to `/thread/:threadId`. The response never streams. Browser console shows 5 retries of `Failed to construct 'URL': Invalid URL` then gives up.
6+
7+
**Working tag**: `2026.3.8`**Broken tags**: `2026.3.12`, `2026.3.12.1`, `2026.3.13`
8+
9+
### Root Cause
10+
11+
Commit `ee578277` (March 10, "feat: add stream recovery and docker dev stack") changed `DistributedStreamSource.createAndStartReader()` from passing a plain string to `fetch()` to using `new URL()` to support query parameters (`run_id`, `after`):
12+
13+
```typescript
14+
// Before ee578277 (working) — plain string, fetch() resolves relative paths natively
15+
this.reader = new FetchStreamReader(`${VITE_API_URL}/threads/${this.threadId}/stream`, ...);
16+
17+
// After ee578277 (broken) — new URL() requires absolute URL or base parameter
18+
const url = new URL(`${VITE_API_URL}/threads/${this.threadId}/stream`);
19+
url.searchParams.set("run_id", this.runId);
20+
```
21+
22+
`VITE_API_URL` defaults to `"/api"` (relative path). `new URL("/api/...")` throws `TypeError: Invalid URL`. The error is classified as "NETWORK" (retryable), causing 5 retry attempts that all fail identically.
23+
24+
**Why it works locally**: Vite dev server with `DISTRIBUTED_WORKERS=false` returns 200 (sync mode) → uses `SyncStreamSource` which reads from the POST response body directly. The broken `DistributedStreamSource` path is never hit.
25+
26+
## Fix
27+
28+
### 1. Add `window.location.origin` as base URL parameter
29+
30+
**File**: `frontend/src/lib/utils/streamSource.ts:201`
31+
32+
```typescript
33+
// Before
34+
const url = new URL(`${VITE_API_URL}/threads/${this.threadId}/stream`);
35+
36+
// After
37+
const url = new URL(`${VITE_API_URL}/threads/${this.threadId}/stream`, window.location.origin);
38+
```
39+
40+
This is the only instance of `new URL()` with `VITE_API_URL` in the codebase. The `window.location.origin` base makes `/api/threads/.../stream` resolve to `https://chat.ruska.ai/api/threads/.../stream`.
41+
42+
**Already applied** in the working tree.
43+
44+
### 2. Graceful decrypt handling (unrelated but included)
45+
46+
**File**: `backend/src/repos/user_settings_repo.py:42-51`
47+
48+
The `_decrypt_keys()` method now catches `ValueError` from `APP_SECRET_KEY` mismatch and returns empty dict with a warning log, instead of crashing with a 500.
49+
50+
**Already applied** in the working tree.
51+
52+
## Verification
53+
54+
1. **Typecheck**: `cd frontend && npx tsc --noEmit` — passes
55+
2. **Local test**: Submit query on `localhost:5173` → redirects to `/thread/:threadId` with streamed response
56+
3. **Prod test (post-deploy)**: Submit query on `chat.ruska.ai` → should redirect to `/thread/:threadId` instead of staying on `/chat`
57+
4. **Console check**: No more `Failed to construct 'URL': Invalid URL` errors in browser console
58+
59+
## Files Modified
60+
61+
| File | Change |
62+
|------|--------|
63+
| `frontend/src/lib/utils/streamSource.ts` | Add `window.location.origin` base to `new URL()` |
64+
| `backend/src/repos/user_settings_repo.py` | Graceful `APP_SECRET_KEY` mismatch handling |

docker-compose.dev.yml

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,9 @@ x-backend-common: &backend-common
1212
environment:
1313
APP_ENV: development
1414
APP_LOG_LEVEL: DEBUG
15-
APP_SECRET_KEY: 6WhxV3XZiV8WEuIm956lU1UYcuLz8JQgGmTtGZnmNAk=
16-
JWT_SECRET_KEY: docker-dev-jwt-secret
17-
POSTGRES_CONNECTION_STRING: postgresql://admin:test1234@postgres:5432/lg_template_dev?sslmode=disable
18-
POSTGRES_CONNECTION_STRING_SESSION: postgresql://admin:test1234@postgres:5432/lg_template_dev?sslmode=disable
1915
REDIS_URL: redis://redis:6379/0
20-
DISTRIBUTED_WORKERS: "true"
2116
SEARX_SEARCH_HOST_URL: http://search_engine:8080
17+
DISTRIBUTED_WORKERS: "true"
2218
volumes:
2319
- ./backend:/app
2420
- backend_venv:/app/.venv

0 commit comments

Comments
 (0)