Skip to content

Fix remote-device tool timing out on scheduled runs (Redis-backed broker)#2511

Merged
dartpain merged 2 commits into
mainfrom
fix/remote-device-broker-redis-cross-process
May 29, 2026
Merged

Fix remote-device tool timing out on scheduled runs (Redis-backed broker)#2511
dartpain merged 2 commits into
mainfrom
fix/remote-device-broker-redis-cross-process

Conversation

@dartpain

@dartpain dartpain commented May 29, 2026

Copy link
Copy Markdown
Contributor

Problem

The remote-device tool (run shell commands on a paired machine) worked when invoked interactively but timed out on every scheduled run. DeviceBroker was an in-process, in-memory singleton; scheduled runs execute in the Celery worker, a separate process from the gunicorn web tier that holds the device's SSE session. A worker-side dispatch_invocation therefore never reached the device, and _collect_result always hit its deadline → "device did not respond (timed out)". Interactive worked only because the web tier is a single process (gunicorn -w 1) shared with the SSE session.

Fix

Make the broker Redis-backed so every hop crosses the process boundary (reusing the existing get_redis_instance() / CACHE_REDIS_URL — no new infra; both web and worker already point at the same Redis):

State Now
queued commands Redis list dev:cmd:{device_id} (BLPOP)
output chunks Redis stream dev:out:{invocation_id} (XADD/XREAD)
invocation metadata/result Redis hash dev:inv:{invocation_id}
SSE upgrade ticket Redis key dev:ticket:{device_id} (TTL)

Per-connection SSE state stays local to the web process; the public broker API is preserved. This also makes the web tier safe to scale past one worker.

Concurrency hardening

The fix was reviewed adversarially and tested against real Redis (including a true cross-process subprocess reproduction). That caught — and this PR fixes — a race that had reintroduced the same false-timeout symptom:

  • Writer ordering + drain flush: XADD the output/control chunk before setting completed=1, and drain_output does a final non-blocking sweep after observing completion — so a reader can't see completion and stop before the control chunk is on the stream.
  • _collect_result: builds the result from drained chunks, checks the deadline after capturing each chunk (no dropped near-deadline control chunk), and falls back to the authoritative snapshot before cleanup when no control chunk was seen.
  • Audit: written from locally-captured fields so it survives the worker racing to delete the invocation; a denied command now records a terminal "denied" outcome instead of staying "dispatched".
  • cmd-queue TTL → 900s (≥ max drain deadline); dispatch-failure & reaped-invocation cleanup; UTF-8 byte counts.

Tests

New tests/devices/: conftest.py (in-memory FakeRedis double), test_broker_cross_process.py (regression for the worker→web gap), test_broker_race.py (8 race regressions that fail against the pre-fix code), test_submit_output_audit.py (route-level audit incl. the snapshot-deleted-by-race case + denial). Rewrote drain/cleanup/ticket tests for the Redis contract. ruff clean; device + tool-executor suites green (201 tests).

Deferred (low/nit, not regressions from this change)

  • Best-effort residual window in the cleanup-vs-BLPOP orphan guard (narrowed; full close needs an atomic Lua pop).
  • No "[output truncated]" marker when a >10k-chunk command overflows the output stream's MAXLEN.
  • Per-SSE BLPOP holds one of the 32 WSGI threads → ~32 concurrent device sessions per web worker (operational).
  • xread assumes RESP2 (safe at the redis-py protocol=2 default; commented).

Verification

Automated proof is the real-Redis cross-process e2e. A full end-to-end with a physically paired device + a live scheduled run is still worth doing manually before release — please attach a screenshot/recording of a scheduled remote-device run succeeding.

…he device

The remote-device tool worked interactively but timed out on every scheduled
run. DeviceBroker was an in-process, in-memory singleton, but scheduled runs
execute in the Celery worker — a different process from the gunicorn web tier
that holds the device's SSE session — so a worker-side dispatch never reached
the device and the tool always hit its deadline.

Make the broker Redis-backed so every hop crosses the process boundary:
- queued commands       -> Redis list   dev:cmd:{device_id}
- output chunks         -> Redis stream dev:out:{invocation_id}
- invocation metadata   -> Redis hash   dev:inv:{invocation_id}
- SSE upgrade tickets    -> Redis key    dev:ticket:{device_id}
Per-connection SSE session state stays in the web process. Reuses the existing
get_redis_instance()/CACHE_REDIS_URL; no new infrastructure. Also makes the web
tier safe to scale past one worker.

Concurrency hardening (from adversarial review + real-Redis e2e):
- XADD the output/control chunk before flipping completed=1, and have
  drain_output do a final non-blocking flush after observing completion, so a
  reader can't see completion and stop before the control chunk lands (this had
  reintroduced the false "device did not respond (timed out)" under a race).
- _collect_result builds the result from drained chunks, checks the deadline
  only after capturing a chunk, and falls back to the authoritative snapshot
  (before cleanup) when no control chunk was observed.
- Audit outcome is written from locally-known fields so it survives the worker
  racing to delete the invocation; a denied command now records a terminal
  "denied" outcome instead of staying "dispatched".
- cmd-queue TTL raised to 900s (>= max drain deadline); dispatch-failure and
  reaped-invocation cleanup; UTF-8 byte counts.

Tests: new tests/devices/{conftest (FakeRedis double), test_broker_cross_process,
test_broker_race, test_submit_output_audit}; drain/cleanup/ticket tests rewritten
for the Redis contract. The race tests fail against the pre-fix code. ruff clean;
device + tool-executor suites green.
@vercel

vercel Bot commented May 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
nextra-docsgpt Building Building Preview, Comment May 29, 2026 11:28am
oss-docsgpt Ready Ready Preview, Comment May 29, 2026 11:28am

Request Review

Comment thread application/devices/broker.py Fixed

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

The DeviceBroker is rewritten from an in-process singleton to a Redis-backed broker so that remote-device invocations dispatched from a Celery worker (e.g., a scheduled run) can reach the device's SSE session held by the web process. The fix also hardens against a completion-vs-control-chunk race that could re-introduce the same false-timeout symptom, and rewrites the device test suite around an in-memory FakeRedis double.

Changes:

  • Move queued commands, output chunks, invocation metadata, and SSE tickets into Redis keys (dev:cmd:*, dev:out:*, dev:inv:*, dev:ticket:*), keeping only per-connection SessionState local.
  • Order writers so output XADDs land before completed=1, give drain_output a final non-blocking flush, and make _collect_result capture chunks before the deadline check with an authoritative-snapshot fallback; record audit (including a terminal "denied" outcome) from locally-captured control fields so it survives worker-side cleanup.
  • Add tests/devices/conftest.py FakeRedis plus cross-process, race, and route-level audit suites; rewrite drain/cleanup/ticket tests against the Redis contract.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
application/devices/broker.py Redis-backed broker (commands/output/tickets/metadata) replacing the in-memory broker.
application/api/devices/session.py SSE handler uses next_command; ack records denial audit; output handler writes audit from local control chunk.
application/agents/tools/remote_device.py _collect_result drains via broker, capture-before-deadline, snapshot fallback when no control chunk.
application/core/settings.py New REMOTE_DEVICE_CMD_QUEUE_TTL_SECONDS / INVOCATION_TTL_SECONDS / OUTPUT_STREAM_MAXLEN tunables.
tests/devices/conftest.py New FakeRedis double + broker_env fixture for broker tests.
tests/devices/test_broker_cross_process.py Regression that worker dispatch reaches a web-side session through shared Redis.
tests/devices/test_broker_race.py Race regressions: final-flush, snapshot fallback, near-deadline capture, dispatch failure cleanup, reaped invocation, UTF-8 byte counts, queue TTL.
tests/devices/test_submit_output_audit.py Route-level audit coverage incl. snapshot-deleted-by-race and denied-ack cases.
tests/devices/test_broker_drain.py Rewritten drain tests against the Redis contract (control chunk, completion flag, unavailable Redis).
tests/devices/test_broker_cleanup.py Cleanup tests asserting Redis hash/stream/list state.
tests/devices/test_session_ticket.py Ticket tests use FakeRedis; expiry simulated by explicit delete; reuse-while-unexpired added.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.05178% with 74 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.55%. Comparing base (9b8fe2d) to head (ca0b22a).
⚠️ Report is 51 commits behind head on main.

Files with missing lines Patch % Lines
application/devices/broker.py 75.19% 65 Missing ⚠️
application/api/devices/session.py 65.21% 8 Missing ⚠️
application/agents/tools/remote_device.py 95.23% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2511      +/-   ##
==========================================
- Coverage   91.34%   89.55%   -1.79%     
==========================================
  Files         248      288      +40     
  Lines       20709    25694    +4985     
==========================================
+ Hits        18916    23010    +4094     
- Misses       1793     2684     +891     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Addresses the code-quality lint on the best-effort hash delete in
dispatch_invocation's failure path: replace the bare `except: pass` with a
logger.debug carrying the invocation_id. No behavior change — cleanup stays
best-effort and still returns a failed Invocation.
@dartpain dartpain merged commit 94bc4d1 into main May 29, 2026
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

application Application tests Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants