fix(delivery): isolate per-session failures so one bad session can't stall delivery for all#2797
Closed
mashkovtsevlx wants to merge 1 commit into
Closed
Conversation
…stall delivery for all The active and sweep delivery poll loops iterate every session in a plain for-loop wrapped in a single try/catch. deliverSessionMessages re-threw on failure, so an unhandled error for one session aborted the entire tick and silently halted message delivery for every other agent until a daemon restart. Observed failure: a crashed container left an orphaned hot journal (outbound.db-journal) beside its outbound.db. drainSession opens outbound.db read-only (single-writer invariant), but rolling back the hot journal requires a write, so even the SELECT in getDueOutboundMessages threw "attempt to write a readonly database" on every tick (~1.3s), poisoning delivery for all sessions ordered after the broken one. A monitoring agent on another session stopped receiving its scheduled tasks and stopped delivering alerts for hours. Catch and log per session in deliverSessionMessages so a single unhealthy session is contained. The broken session self-heals on its next container start, when the writer opens the DB read-write and rolls the journal back. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Closing as a duplicate of #2750, which is more complete (it actively recovers stranded journals via |
This was referenced Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2796.
Problem
pollActive/pollSweepiterate every session in a plainforloop wrapped in a singletry/catch.deliverSessionMessagesre-threw on failure, so an unhandled error for one session aborted the entire delivery tick and silently halted message delivery for every other agent until a daemon restart.The trigger we hit in production: a crashed container left an orphaned hot journal (
outbound.db-journal).drainSessionopensoutbound.dbread-only (single-writer invariant), but rolling back a hot journal requires a write — so even theSELECTingetDueOutboundMessagesthrewattempt to write a readonly databaseon every tick (~1.3s), poisoning delivery for all sessions ordered after the broken one. An unrelated monitoring agent stopped receiving scheduled tasks and delivering messages for hours, despite its own container being perfectly healthy.Fix
Catch and log at the per-session boundary in
deliverSessionMessages, so a single unhealthy session is contained instead of taking the whole install down. The broken session self-heals on its next container start, when the writer opens the DB read-write and rolls the journal back.Minimal and surgical — no behavior change for healthy sessions; the existing
inflightDeliveriescleanup stays infinally.Testing
Deployed to a live multi-agent Raspberry Pi install that was exhibiting the stall. After deploy: delivery errors dropped to zero, the previously-starved monitoring agent immediately resumed receiving its cron tasks and delivering messages, and the unhealthy session recovered on its next container spawn.
Follow-ups (not in this PR)
sessions.container_status = 'running'when the container no longer exists.