fix(delivery): isolate per-session failures so one bad session can't stall delivery for all

mashkovtsevlx · claude · mashkovtsevlx · commit 28eb03846324 · 2026-06-17T21:27:41.000+08:00
The active and sweep delivery poll loops iterate every session in a plain
for-loop wrapped in a single try/catch. deliverSessionMessages re-threw on
failure, so an unhandled error for one session aborted the entire tick and
silently halted message delivery for every other agent until a daemon restart.

Observed failure: a crashed container left an orphaned hot journal
(outbound.db-journal) beside its outbound.db. drainSession opens outbound.db
read-only (single-writer invariant), but rolling back the hot journal requires
a write, so even the SELECT in getDueOutboundMessages threw "attempt to write a
readonly database" on every tick (~1.3s), poisoning delivery for all sessions
ordered after the broken one. A monitoring agent on another session stopped
receiving its scheduled tasks and stopped delivering alerts for hours.

Catch and log per session in deliverSessionMessages so a single unhealthy
session is contained. The broken session self-heals on its next container
start, when the writer opens the DB read-write and rolls the journal back.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/delivery.ts b/src/delivery.ts
@@ -159,6 +159,27 @@ export async function deliverSessionMessages(session: Session): Promise<void> {
 
   try {
     await drainSession(session);
+  } catch (err) {
+    // Isolate per-session delivery failures. The active/sweep poll loops call
+    // this for every session in a plain for-loop; an unhandled throw here
+    // aborts the entire tick, so a single unhealthy session silently stalls
+    // delivery for every other agent until the daemon restarts.
+    //
+    // The known trigger: a crashed container can leave an orphaned hot journal
+    // (outbound.db-journal) next to its outbound.db. drainSession opens that DB
+    // read-only (single-writer invariant), but even a read SELECT must roll the
+    // journal back — a write — which fails with "attempt to write a readonly
+    // database". That throw then poisons delivery for all sessions ordered
+    // after the broken one in the loop.
+    //
+    // Containment: log and move on. The broken session self-heals on its next
+    // container start (the writer opens the DB read-write and rolls the journal
+    // back), instead of taking the whole install down with it.
+    log.error('Session delivery failed, skipping until next tick', {
+      sessionId: session.id,
+      agentGroupId: session.agent_group_id,
+      err,
+    });
   } finally {
     inflightDeliveries.delete(session.id);
   }