Fix remote-device tool timing out on scheduled runs (Redis-backed broker) by dartpain · Pull Request #2511 · arc53/DocsGPT

dartpain · 2026-05-29T11:00:09Z

Problem

The remote-device tool (run shell commands on a paired machine) worked when invoked interactively but timed out on every scheduled run. DeviceBroker was an in-process, in-memory singleton; scheduled runs execute in the Celery worker, a separate process from the gunicorn web tier that holds the device's SSE session. A worker-side dispatch_invocation therefore never reached the device, and _collect_result always hit its deadline → "device did not respond (timed out)". Interactive worked only because the web tier is a single process (gunicorn -w 1) shared with the SSE session.

Fix

Make the broker Redis-backed so every hop crosses the process boundary (reusing the existing get_redis_instance() / CACHE_REDIS_URL — no new infra; both web and worker already point at the same Redis):

State	Now
queued commands	Redis list `dev:cmd:{device_id}` (BLPOP)
output chunks	Redis stream `dev:out:{invocation_id}` (XADD/XREAD)
invocation metadata/result	Redis hash `dev:inv:{invocation_id}`
SSE upgrade ticket	Redis key `dev:ticket:{device_id}` (TTL)

Per-connection SSE state stays local to the web process; the public broker API is preserved. This also makes the web tier safe to scale past one worker.

Concurrency hardening

The fix was reviewed adversarially and tested against real Redis (including a true cross-process subprocess reproduction). That caught — and this PR fixes — a race that had reintroduced the same false-timeout symptom:

Writer ordering + drain flush: XADD the output/control chunk before setting completed=1, and drain_output does a final non-blocking sweep after observing completion — so a reader can't see completion and stop before the control chunk is on the stream.
_collect_result: builds the result from drained chunks, checks the deadline after capturing each chunk (no dropped near-deadline control chunk), and falls back to the authoritative snapshot before cleanup when no control chunk was seen.
Audit: written from locally-captured fields so it survives the worker racing to delete the invocation; a denied command now records a terminal "denied" outcome instead of staying "dispatched".
cmd-queue TTL → 900s (≥ max drain deadline); dispatch-failure & reaped-invocation cleanup; UTF-8 byte counts.

Tests

New tests/devices/: conftest.py (in-memory FakeRedis double), test_broker_cross_process.py (regression for the worker→web gap), test_broker_race.py (8 race regressions that fail against the pre-fix code), test_submit_output_audit.py (route-level audit incl. the snapshot-deleted-by-race case + denial). Rewrote drain/cleanup/ticket tests for the Redis contract. ruff clean; device + tool-executor suites green (201 tests).

Deferred (low/nit, not regressions from this change)

Best-effort residual window in the cleanup-vs-BLPOP orphan guard (narrowed; full close needs an atomic Lua pop).
No "[output truncated]" marker when a >10k-chunk command overflows the output stream's MAXLEN.
Per-SSE BLPOP holds one of the 32 WSGI threads → ~32 concurrent device sessions per web worker (operational).
xread assumes RESP2 (safe at the redis-py protocol=2 default; commented).

Verification

Automated proof is the real-Redis cross-process e2e. A full end-to-end with a physically paired device + a live scheduled run is still worth doing manually before release — please attach a screenshot/recording of a scheduled remote-device run succeeding.

…he device The remote-device tool worked interactively but timed out on every scheduled run. DeviceBroker was an in-process, in-memory singleton, but scheduled runs execute in the Celery worker — a different process from the gunicorn web tier that holds the device's SSE session — so a worker-side dispatch never reached the device and the tool always hit its deadline. Make the broker Redis-backed so every hop crosses the process boundary: - queued commands -> Redis list dev:cmd:{device_id} - output chunks -> Redis stream dev:out:{invocation_id} - invocation metadata -> Redis hash dev:inv:{invocation_id} - SSE upgrade tickets -> Redis key dev:ticket:{device_id} Per-connection SSE session state stays in the web process. Reuses the existing get_redis_instance()/CACHE_REDIS_URL; no new infrastructure. Also makes the web tier safe to scale past one worker. Concurrency hardening (from adversarial review + real-Redis e2e): - XADD the output/control chunk before flipping completed=1, and have drain_output do a final non-blocking flush after observing completion, so a reader can't see completion and stop before the control chunk lands (this had reintroduced the false "device did not respond (timed out)" under a race). - _collect_result builds the result from drained chunks, checks the deadline only after capturing a chunk, and falls back to the authoritative snapshot (before cleanup) when no control chunk was observed. - Audit outcome is written from locally-known fields so it survives the worker racing to delete the invocation; a denied command now records a terminal "denied" outcome instead of staying "dispatched". - cmd-queue TTL raised to 900s (>= max drain deadline); dispatch-failure and reaped-invocation cleanup; UTF-8 byte counts. Tests: new tests/devices/{conftest (FakeRedis double), test_broker_cross_process, test_broker_race, test_submit_output_audit}; drain/cleanup/ticket tests rewritten for the Redis contract. The race tests fail against the pre-fix code. ruff clean; device + tool-executor suites green.

vercel · 2026-05-29T11:00:12Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
nextra-docsgpt	Building	Preview, Comment	May 29, 2026 11:28am
oss-docsgpt	Ready	Preview, Comment	May 29, 2026 11:28am

Copilot

Pull request overview

The DeviceBroker is rewritten from an in-process singleton to a Redis-backed broker so that remote-device invocations dispatched from a Celery worker (e.g., a scheduled run) can reach the device's SSE session held by the web process. The fix also hardens against a completion-vs-control-chunk race that could re-introduce the same false-timeout symptom, and rewrites the device test suite around an in-memory FakeRedis double.

Changes:

Move queued commands, output chunks, invocation metadata, and SSE tickets into Redis keys (dev:cmd:*, dev:out:*, dev:inv:*, dev:ticket:*), keeping only per-connection SessionState local.
Order writers so output XADDs land before completed=1, give drain_output a final non-blocking flush, and make _collect_result capture chunks before the deadline check with an authoritative-snapshot fallback; record audit (including a terminal "denied" outcome) from locally-captured control fields so it survives worker-side cleanup.
Add tests/devices/conftest.py FakeRedis plus cross-process, race, and route-level audit suites; rewrite drain/cleanup/ticket tests against the Redis contract.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
application/devices/broker.py	Redis-backed broker (commands/output/tickets/metadata) replacing the in-memory broker.
application/api/devices/session.py	SSE handler uses `next_command`; ack records denial audit; output handler writes audit from local control chunk.
application/agents/tools/remote_device.py	`_collect_result` drains via broker, capture-before-deadline, snapshot fallback when no control chunk.
application/core/settings.py	New `REMOTE_DEVICE_CMD_QUEUE_TTL_SECONDS` / `INVOCATION_TTL_SECONDS` / `OUTPUT_STREAM_MAXLEN` tunables.
tests/devices/conftest.py	New `FakeRedis` double + `broker_env` fixture for broker tests.
tests/devices/test_broker_cross_process.py	Regression that worker dispatch reaches a web-side session through shared Redis.
tests/devices/test_broker_race.py	Race regressions: final-flush, snapshot fallback, near-deadline capture, dispatch failure cleanup, reaped invocation, UTF-8 byte counts, queue TTL.
tests/devices/test_submit_output_audit.py	Route-level audit coverage incl. snapshot-deleted-by-race and denied-ack cases.
tests/devices/test_broker_drain.py	Rewritten drain tests against the Redis contract (control chunk, completion flag, unavailable Redis).
tests/devices/test_broker_cleanup.py	Cleanup tests asserting Redis hash/stream/list state.
tests/devices/test_session_ticket.py	Ticket tests use `FakeRedis`; expiry simulated by explicit delete; reuse-while-unexpired added.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-05-29T11:27:35Z

Codecov Report

❌ Patch coverage is 76.05178% with 74 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.55%. Comparing base (9b8fe2d) to head (ca0b22a).
⚠️ Report is 51 commits behind head on main.

Files with missing lines	Patch %	Lines
application/devices/broker.py	75.19%	65 Missing ⚠️
application/api/devices/session.py	65.21%	8 Missing ⚠️
application/agents/tools/remote_device.py	95.23%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2511      +/-   ##
==========================================
- Coverage   91.34%   89.55%   -1.79%     
==========================================
  Files         248      288      +40     
  Lines       20709    25694    +4985     
==========================================
+ Hits        18916    23010    +4094     
- Misses       1793     2684     +891

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Addresses the code-quality lint on the best-effort hash delete in dispatch_invocation's failure path: replace the bare `except: pass` with a logger.debug carrying the invocation_id. No behavior change — cleanup stays best-effort and still returns a failed Invocation.

github-actions Bot added application Application tests Tests labels May 29, 2026

vercel Bot deployed to Preview – oss-docsgpt May 29, 2026 11:00 View deployment

vercel Bot deployed to Preview – nextra-docsgpt May 29, 2026 11:01 View deployment

github-code-quality Bot found potential problems May 29, 2026

View reviewed changes

Comment thread application/devices/broker.py Fixed

dartpain requested a review from Copilot May 29, 2026 11:03

Copilot started reviewing on behalf of dartpain May 29, 2026 11:03 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

vercel Bot deployed to Preview – oss-docsgpt May 29, 2026 11:28 View deployment

vercel Bot deployed to Preview – nextra-docsgpt May 29, 2026 11:28 View deployment

dartpain merged commit 94bc4d1 into main May 29, 2026
17 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix remote-device tool timing out on scheduled runs (Redis-backed broker)#2511

Fix remote-device tool timing out on scheduled runs (Redis-backed broker)#2511
dartpain merged 2 commits into
mainfrom
fix/remote-device-broker-redis-cross-process

dartpain commented May 29, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dartpain commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Concurrency hardening

Tests

Deferred (low/nit, not regressions from this change)

Verification

Uh oh!

vercel Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dartpain commented May 29, 2026 •

edited

Loading

vercel Bot commented May 29, 2026 •

edited

Loading

codecov Bot commented May 29, 2026 •

edited

Loading