Skip to content

fix(delivery): isolate per-session failures so one bad session can't stall delivery for all#2797

Closed
mashkovtsevlx wants to merge 1 commit into
nanocoai:mainfrom
mashkovtsevlx:fix/isolate-per-session-delivery-errors
Closed

fix(delivery): isolate per-session failures so one bad session can't stall delivery for all#2797
mashkovtsevlx wants to merge 1 commit into
nanocoai:mainfrom
mashkovtsevlx:fix/isolate-per-session-delivery-errors

Conversation

@mashkovtsevlx

Copy link
Copy Markdown
Contributor

Fixes #2796.

Problem

pollActive/pollSweep iterate every session in a plain for loop wrapped in a single try/catch. deliverSessionMessages re-threw on failure, so an unhandled error for one session aborted the entire delivery tick and silently halted message delivery for every other agent until a daemon restart.

The trigger we hit in production: a crashed container left an orphaned hot journal (outbound.db-journal). drainSession opens outbound.db read-only (single-writer invariant), but rolling back a hot journal requires a write — so even the SELECT in getDueOutboundMessages threw attempt to write a readonly database on every tick (~1.3s), poisoning delivery for all sessions ordered after the broken one. An unrelated monitoring agent stopped receiving scheduled tasks and delivering messages for hours, despite its own container being perfectly healthy.

Fix

Catch and log at the per-session boundary in deliverSessionMessages, so a single unhealthy session is contained instead of taking the whole install down. The broken session self-heals on its next container start, when the writer opens the DB read-write and rolls the journal back.

Minimal and surgical — no behavior change for healthy sessions; the existing inflightDeliveries cleanup stays in finally.

Testing

Deployed to a live multi-agent Raspberry Pi install that was exhibiting the stall. After deploy: delivery errors dropped to zero, the previously-starved monitoring agent immediately resumed receiving its cron tasks and delivering messages, and the unhealthy session recovered on its next container spawn.

Follow-ups (not in this PR)

  • Proactively detect/clear orphaned hot journals for dead sessions.
  • Reconcile sessions.container_status = 'running' when the container no longer exists.

…stall delivery for all

The active and sweep delivery poll loops iterate every session in a plain
for-loop wrapped in a single try/catch. deliverSessionMessages re-threw on
failure, so an unhandled error for one session aborted the entire tick and
silently halted message delivery for every other agent until a daemon restart.

Observed failure: a crashed container left an orphaned hot journal
(outbound.db-journal) beside its outbound.db. drainSession opens outbound.db
read-only (single-writer invariant), but rolling back the hot journal requires
a write, so even the SELECT in getDueOutboundMessages threw "attempt to write a
readonly database" on every tick (~1.3s), poisoning delivery for all sessions
ordered after the broken one. A monitoring agent on another session stopped
receiving its scheduled tasks and stopped delivering alerts for hours.

Catch and log per session in deliverSessionMessages so a single unhealthy
session is contained. The broken session self-heals on its next container
start, when the writer opens the DB read-write and rolls the journal back.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mashkovtsevlx

Copy link
Copy Markdown
Contributor Author

Closing as a duplicate of #2750, which is more complete (it actively recovers stranded journals via recoverOutboundJournal and classifies the transient hot-journal race, where this PR only isolated the symptom) and is already in review with tests. I should have searched existing issues/PRs first — #2640 and #2516 already cover this. Corroborating production data point and two follow-up notes left on #2750.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

One unhealthy session stalls message delivery for all agents (unhandled throw aborts the delivery tick)

1 participant