Fixed prod issue

ryaneggz · ryaneggz · commit 96124d596161 · 2026-03-14T01:44:37.000-06:00
diff --git a/.claude/plans/glowing-wishing-sunset.md b/.claude/plans/glowing-wishing-sunset.md
@@ -0,0 +1,88 @@
+# Fix: Production SSE Stream Never Returns (2026.3.8 → 2026.3.12)
+
+## Context
+
+Between tags `2026.3.8` and `2026.3.12`, production streaming broke: POST `/llm/stream` returns 202 with `run_id` successfully, but the SSE stream from `GET /threads/{thread_id}/stream?run_id=X` never delivers data to the frontend. This only manifests in production (Redis 7, distributed workers via TaskIQ).
+
+PR #862 (merged after `2026.3.12` into `2026.3.12.1`) fixed worker crash/reconnect issues (`ResilientRedisStreamBroker`, `_RunScopedStore`, `WorkerState` singleton). **Those fixes address worker-side failures but do NOT fix the root cause: a race condition between the frontend connecting and the worker writing its first event.**
+
+## Root Cause
+
+**Race condition in `stream_thread` route (thread.py:267-277):**
+
+1. `POST /llm/stream` writes `stream_status: "running"` to thread metadata, enqueues TaskIQ task, returns 202
+2. Frontend immediately connects to `GET /threads/{thread_id}/stream?run_id=X`
+3. Route calls `distributed_stream_exists()` — checks if Redis stream key exists
+4. Worker hasn't picked up the task yet → no Redis stream key → returns **false**
+5. Route checks `active_run_id == run_id AND stream_status == "running"` → returns **503** (retryable)
+6. **BUT** if the thread metadata write from step 1 hasn't propagated (new thread, DB lag), or if `active_run_id` doesn't match, it falls through to **404** (NOT retryable)
+7. Frontend's `classifyError` (streamError.ts:49) only retries 500-503. **404 is terminal** → stream dies silently
+
+The irony: `stream_from_redis()` already has a 30-second wait loop for the stream to appear (stream.py:437-446), but the route's pre-check at line 267 prevents ever reaching it.
+
+## Fixes (ordered by priority)
+
+### Fix 1 (P0): Remove premature stream existence pre-check
+**File:** `backend/src/routes/v0/thread.py` (lines 267-277)
+
+When `active_run_id == run_id` and `stream_status == "running"`, skip the `distributed_stream_exists()` check entirely and proceed directly to `stream_from_redis()`. The internal wait loop handles the stream not existing yet.
+
+Only check `distributed_stream_exists()` when `stream_status != "running"` (for replaying completed streams still within Redis TTL).
+
+```
+# Pseudocode for the new logic:
+if active_run_id == run_id and stream_status == "running":
+    # Worker is active — let stream_from_redis() wait for the stream
+    return StreamingResponse(stream_from_redis(...))
+
+# For completed/errored/aborted streams, check if Redis still has the data
+stream_exists = await distributed_stream_exists(thread_id, run_id)
+if not stream_exists:
+    raise HTTPException(404, ...)
+
+return StreamingResponse(stream_from_redis(...))
+```
+
+### Fix 2 (P0 safety net): Make 404 retryable for distributed streams
+**File:** `frontend/src/lib/utils/streamSource.ts` (lines 157-186)
+
+In `startWithRetry()`, treat 404 as retryable for the first 3 attempts. After a fresh POST /llm/stream 202, a 404 on the stream endpoint almost certainly means the worker hasn't created the key yet.
+
+```
+// In startWithRetry, modify the canRetry logic:
+const is404Race = status === 404 && attempt < 3;
+const canRetry = isError && (is404Race || this.isRecoverableClose(error) || isRetryableError(error, status)) && attempt < MAX_ATTEMPTS - 1;
+```
+
+### Fix 3 (P1): Add max lifetime timeout to stream_from_redis
+**File:** `backend/src/utils/stream.py` (line 448+)
+
+The `while True` loop has no exit condition other than receiving `done`/`error` from Redis. If the worker crashes mid-stream, the consumer hangs forever.
+
+Add:
+- `MAX_STREAM_LIFETIME_SECONDS` env var (default 600 = 10 minutes)
+- Track `start_time = time.monotonic()` before the loop
+- Check elapsed time each iteration; if exceeded, yield error + `[DONE]` and return
+- Also track consecutive empty XREAD results; after 5 consecutive (= 5 minutes of silence), yield error and close
+
+### Fix 4 (P2): Stale stream_status cleanup
+**File:** `backend/src/routes/v0/thread.py`
+
+If `stream_status == "running"` but `active_stream_started_at` is older than `MAX_STREAM_LIFETIME_SECONDS`, the worker is dead. Update `stream_status` to `"error"` and return 410 Gone.
+
+This is a read-side check — no background task needed.
+
+## Files to Modify
+
+| File | Change |
+|------|--------|
+| `backend/src/routes/v0/thread.py` | Fix 1: Restructure stream existence check. Fix 4: Stale status detection |
+| `backend/src/utils/stream.py` | Fix 3: Add max lifetime timeout to `stream_from_redis` |
+| `frontend/src/lib/utils/streamSource.ts` | Fix 2: Make 404 retryable for early attempts |
+
+## Verification
+
+1. **Manual test:** Submit a chat in distributed mode, verify stream connects and delivers data
+2. **Race condition test:** Add artificial delay before worker writes initializing event; verify frontend retries and eventually connects
+3. **Timeout test:** Kill worker mid-stream; verify `stream_from_redis` terminates after timeout instead of hanging forever
+4. **Existing tests:** Run `make test` (backend) and `npm run test` (frontend) to verify no regressions
diff --git a/.claude/plans/joyful-weaving-kettle.md b/.claude/plans/joyful-weaving-kettle.md
@@ -0,0 +1,120 @@
+# Plan: Resolve Production Worker Stability & SSE Streaming Issues
+
+## Context
+
+After 19+ hours of uptime, the production worker accumulates orphaned asyncio tasks and Pydantic serializer warnings. Additionally, both worker processes crash simultaneously due to unhandled Redis `ConnectionError` in the TaskIQ broker's `listen()` loop. These issues compound to cause SSE streaming responses not being delivered to users in production, despite healthy backend health checks. Four GitHub issues are needed.
+
+---
+
+## Issue 1: Redis Connection Drops Kill Worker Processes (CRITICAL)
+
+**Problem:** `RedisStreamBroker.listen()` in taskiq-redis has **no error handling** for `ConnectionError`. When Redis closes the connection (server restart, network blip, idle timeout), the exception propagates up through TaskIQ's receiver, killing the worker process. Both worker-0 and worker-1 crash simultaneously. TaskIQ's process manager restarts them, but any in-flight tasks are lost and streams are never completed.
+
+**Evidence:** Logs show `redis.exceptions.ConnectionError: Connection closed by server.` causing `ExceptionGroup: unhandled errors in a TaskGroup` in both workers.
+
+**Note:** `ListQueueBroker.listen()` (line 135 of `redis_broker.py`) already handles this with `except ConnectionError: continue`, but `RedisStreamBroker` does not.
+
+**Fix:** Configure Redis connection pool with `retry_on_error` and `health_check_interval`, and add a custom broker subclass with connection error handling.
+
+### Files to modify
+
+1. **`backend/src/workers/broker.py`** — Subclass `RedisStreamBroker` to add `ConnectionError` handling in `listen()`, and pass `retry_on_error` + `health_check_interval` kwargs to the connection pool
+
+### Implementation details
+
+```python
+from redis.asyncio import Redis
+from redis.exceptions import ConnectionError
+from taskiq_redis import RedisStreamBroker
+
+class ResilientRedisStreamBroker(RedisStreamBroker):
+    """RedisStreamBroker with ConnectionError retry handling."""
+
+    async def listen(self):
+        """Listen with automatic reconnect on ConnectionError."""
+        while True:
+            try:
+                async for message in super().listen():
+                    yield message
+            except ConnectionError as exc:
+                logger.warning("Redis connection error in broker listen: %s. Reconnecting...", exc)
+                await asyncio.sleep(1)  # Brief backoff before reconnect
+                continue
+
+broker = ResilientRedisStreamBroker(
+    url=REDIS_URL,
+    queue_name="orchestra_tasks",
+    health_check_interval=30,  # Ping Redis every 30s to detect dead connections
+    retry_on_error=[ConnectionError],
+).with_result_backend(result_backend)
+```
+
+---
+
+## Issue 2: Orphaned Asyncio Tasks in Worker (LangGraph Store Batch Loop)
+
+**Problem:** Each worker task creates a new `AsyncPostgresStore` via `get_store_db()`. The store's internal batch loop task (`asyncio.create_task(_run())`) is not properly awaited on context manager exit, leaving orphaned tasks that accumulate over hours.
+
+**Fix:** Add a singleton store to `WorkerState`, mirroring the existing checkpointer pattern in `state.py`.
+
+### Files to modify
+
+1. **`backend/src/workers/state.py`** — Add `_store` field, `_store_exit_stack`, initialize in `initialize()`, cleanup in `shutdown()`, add `get_store()` and `get_worker_store()` convenience function
+2. **`backend/src/workers/tasks.py`** — Replace `async with get_store_db() as store:` with `store = await get_worker_store()`
+3. **`backend/src/workers/broker.py`** — Ensure lifecycle hooks call store init/shutdown (already structured for this)
+
+### Implementation details
+
+- Use `contextlib.AsyncExitStack` to manage the store context manager lifecycle in `WorkerState`
+- On shutdown, exit the stack (which calls `__aexit__`), then cancel the store's internal `_task` and await with timeout as defense-in-depth
+- Keep `get_store_db()` in `db.py` unchanged for non-worker callers (API routes)
+
+---
+
+## Issue 3: Pydantic Serializer Warnings
+
+**Problem:** Worker produces repeated Pydantic serializer warnings from legacy `.dict()` calls and `model_dump()` on LangChain `BaseMessage` objects (which extend `Serializable`, not Pydantic `BaseModel`).
+
+**Fix:** Update serialization calls to use correct methods per object type.
+
+### Files to modify
+
+1. **`backend/src/agents/__init__.py`** (lines 64, 104) — Replace `memory.dict()` with `memory.model_dump()` or access `.value` directly
+2. **`backend/src/repos/thread_repo.py`** (line 67) — Replace `.dict()` with `.model_dump()`
+3. **`backend/src/utils/messages.py`** (line 20) — Use LangChain's `message_to_dict()` for `BaseMessage` instead of `.model_dump()`
+4. **`backend/src/utils/stream.py`** (line 62) — Same fix in `_to_dict()`: type-check for `BaseMessage` before calling serialization
+
+---
+
+## Issue 4: Production SSE Streaming Not Returning Responses
+
+**Problem:** Production uses distributed mode (nginx + Redis streams). Responses not reaching users despite healthy backend. Root causes are a combination of Issues 1-3 plus missing nginx buffering headers and stream race conditions.
+
+**Fix:** Multiple targeted changes to eliminate buffering and race conditions.
+
+### Files to modify
+
+1. **`backend/src/routes/v0/llm.py`** (line 150) — Add `"X-Accel-Buffering": "no"` header to sync-mode `StreamingResponse` (already present on distributed endpoint in `thread.py:285`)
+2. **`backend/src/workers/tasks.py`** (~line 105) — Write an "initializing" event to Redis stream immediately on task start, before heavy init work, to eliminate the race where client polls before stream exists
+3. **`backend/src/utils/stream.py`** — In `stream_from_redis()`, add a wait-loop when stream doesn't exist yet (up to 30s with keep-alive comments) instead of immediately raising `FileNotFoundError`
+
+---
+
+## Deployment Order
+
+1. **Issue 1 (Redis resilience)** — CRITICAL. Workers crashing is the most likely cause of missing responses. Lowest risk fix (subclass + config).
+2. **Issue 4 (SSE headers + race conditions)** — High user impact, low risk. `X-Accel-Buffering` header is a one-liner.
+3. **Issue 3 (Pydantic warnings)** — Low risk, reduces log noise, aids debugging.
+4. **Issue 2 (Singleton store)** — Medium risk, fixes long-running stability. Deploy after validating Issue 1 fix.
+
+## Verification
+
+- **Issue 1:** Simulate Redis restart (`docker restart orchestra-dev-redis-1`), verify workers reconnect without crashing
+- **Issue 2:** Run worker for extended period, monitor for "Task was destroyed" warnings via `docker logs`
+- **Issue 3:** Check worker logs for absence of Pydantic serializer warnings
+- **Issue 4:** Send chat message in production, verify SSE stream delivers response. Test with `curl -N` behind nginx.
+- All: `make test` passes, `make format` clean
+
+## GitHub Issues
+
+Create 4 separate issues on the repo, each referencing the relevant files and fix approach above. Label with `bug` and `backend`.
diff --git a/.claude/plans/optimized-greeting-wigderson.md b/.claude/plans/optimized-greeting-wigderson.md
@@ -0,0 +1,43 @@
+# Plan: Fix #858 — Redis connection drops kill worker processes
+
+## Context
+
+Issue [#858](https://github.com/ruska-ai/orchestra/issues/858): When Redis closes connections (restart, network blip, idle timeout), `RedisStreamBroker.listen()` raises an unhandled `ConnectionError` that propagates up and kills the worker processes. Both worker-0 and worker-1 crash simultaneously.
+
+The fix is **already implemented** as unstaged changes on `development`. The changes also include related improvements to worker resilience and streaming reliability that were developed alongside the broker fix.
+
+## Changes (already implemented, need commit + PR)
+
+### 1. `backend/src/workers/broker.py` — Core fix for #858
+- Subclasses `RedisStreamBroker` as `ResilientRedisStreamBroker`
+- Wraps `listen()` with `ConnectionError` catch → logs warning → 1s backoff → reconnects
+- Swaps `broker` instantiation to use the new subclass
+
+### 2. `backend/src/utils/stream.py` — Stream polling for worker init lag
+- Instead of immediately raising `FileNotFoundError` when the Redis stream doesn't exist, polls up to 30s (1s interval) sending SSE `: waiting\n\n` keepalive comments
+- Prevents client-side errors when the worker is still initializing
+
+### 3. `backend/src/workers/state.py` — Shared `AsyncPostgresStore` singleton
+- Adds `_store` and `_store_exit_stack` to `WorkerState` for a worker-level store singleton
+- Adds `get_store()` class method and `get_worker_store()` convenience function
+- Properly shuts down store (exit stack + internal task cancellation) in `shutdown()`
+
+### 4. `backend/src/workers/tasks.py` — Use worker-level store + early stream init
+- Writes an `initializing` event to the Redis stream immediately so clients see activity before heavy init
+- Uses `get_worker_store()` instead of per-task `get_store_db()` context manager (eliminates per-task DB connection overhead)
+
+### 5. `backend/src/routes/v0/llm.py` — Disable proxy buffering
+- Adds `X-Accel-Buffering: no` header to SSE responses to prevent nginx/reverse proxy buffering
+
+## Steps to complete
+
+1. **Run tests**: `cd backend && make test` — verify all existing tests pass with these changes
+2. **Run format/lint**: `cd backend && make format && make lint`
+3. **Commit** all 4 changed files with a descriptive message referencing #858
+4. **Create PR** targeting `development`
+
+## Verification
+
+1. Run `make test` — all backend tests pass
+2. Manual: `docker restart orchestra-dev-redis-1` while workers are running — workers should log a warning and reconnect within ~1s, not crash
+3. Manual: Start a streaming request immediately after kicking a worker task — the client should receive `: waiting` keepalive comments until the stream appears
diff --git a/.claude/plans/quizzical-munching-petal.md b/.claude/plans/quizzical-munching-petal.md
@@ -0,0 +1,64 @@
+# Fix: SSE Stream URL Construction Breaks in Production (Distributed Mode)
+
+## Context
+
+Since tag `2026.3.12`, submitting a chat message on **chat.ruska.ai** creates a thread but never redirects to `/thread/:threadId`. The response never streams. Browser console shows 5 retries of `Failed to construct 'URL': Invalid URL` then gives up.
+
+**Working tag**: `2026.3.8` — **Broken tags**: `2026.3.12`, `2026.3.12.1`, `2026.3.13`
+
+### Root Cause
+
+Commit `ee578277` (March 10, "feat: add stream recovery and docker dev stack") changed `DistributedStreamSource.createAndStartReader()` from passing a plain string to `fetch()` to using `new URL()` to support query parameters (`run_id`, `after`):
+
+```typescript
+// Before ee578277 (working) — plain string, fetch() resolves relative paths natively
+this.reader = new FetchStreamReader(`${VITE_API_URL}/threads/${this.threadId}/stream`, ...);
+
+// After ee578277 (broken) — new URL() requires absolute URL or base parameter
+const url = new URL(`${VITE_API_URL}/threads/${this.threadId}/stream`);
+url.searchParams.set("run_id", this.runId);
+```
+
+`VITE_API_URL` defaults to `"/api"` (relative path). `new URL("/api/...")` throws `TypeError: Invalid URL`. The error is classified as "NETWORK" (retryable), causing 5 retry attempts that all fail identically.
+
+**Why it works locally**: Vite dev server with `DISTRIBUTED_WORKERS=false` returns 200 (sync mode) → uses `SyncStreamSource` which reads from the POST response body directly. The broken `DistributedStreamSource` path is never hit.
+
+## Fix
+
+### 1. Add `window.location.origin` as base URL parameter
+
+**File**: `frontend/src/lib/utils/streamSource.ts:201`
+
+```typescript
+// Before
+const url = new URL(`${VITE_API_URL}/threads/${this.threadId}/stream`);
+
+// After
+const url = new URL(`${VITE_API_URL}/threads/${this.threadId}/stream`, window.location.origin);
+```
+
+This is the only instance of `new URL()` with `VITE_API_URL` in the codebase. The `window.location.origin` base makes `/api/threads/.../stream` resolve to `https://chat.ruska.ai/api/threads/.../stream`.
+
+**Already applied** in the working tree.
+
+### 2. Graceful decrypt handling (unrelated but included)
+
+**File**: `backend/src/repos/user_settings_repo.py:42-51`
+
+The `_decrypt_keys()` method now catches `ValueError` from `APP_SECRET_KEY` mismatch and returns empty dict with a warning log, instead of crashing with a 500.
+
+**Already applied** in the working tree.
+
+## Verification
+
+1. **Typecheck**: `cd frontend && npx tsc --noEmit` — passes
+2. **Local test**: Submit query on `localhost:5173` → redirects to `/thread/:threadId` with streamed response
+3. **Prod test (post-deploy)**: Submit query on `chat.ruska.ai` → should redirect to `/thread/:threadId` instead of staying on `/chat`
+4. **Console check**: No more `Failed to construct 'URL': Invalid URL` errors in browser console
+
+## Files Modified
+
+| File | Change |
+|------|--------|
+| `frontend/src/lib/utils/streamSource.ts` | Add `window.location.origin` base to `new URL()` |
+| `backend/src/repos/user_settings_repo.py` | Graceful `APP_SECRET_KEY` mismatch handling |
diff --git a/docker-compose.dev.yml b/docker-compose.dev.yml
@@ -12,13 +12,9 @@ x-backend-common: &backend-common
     environment:
         APP_ENV: development
         APP_LOG_LEVEL: DEBUG
-        APP_SECRET_KEY: 6WhxV3XZiV8WEuIm956lU1UYcuLz8JQgGmTtGZnmNAk=
-        JWT_SECRET_KEY: docker-dev-jwt-secret
-        POSTGRES_CONNECTION_STRING: postgresql://admin:test1234@postgres:5432/lg_template_dev?sslmode=disable
-        POSTGRES_CONNECTION_STRING_SESSION: postgresql://admin:test1234@postgres:5432/lg_template_dev?sslmode=disable
         REDIS_URL: redis://redis:6379/0
-        DISTRIBUTED_WORKERS: "true"
         SEARX_SEARCH_HOST_URL: http://search_engine:8080
+        DISTRIBUTED_WORKERS: "true"
     volumes:
         - ./backend:/app
         - backend_venv:/app/.venv