Skip to content

Commit 28eb038

Browse files
mashkovtsevlxclaude
andcommitted
fix(delivery): isolate per-session failures so one bad session can't stall delivery for all
The active and sweep delivery poll loops iterate every session in a plain for-loop wrapped in a single try/catch. deliverSessionMessages re-threw on failure, so an unhandled error for one session aborted the entire tick and silently halted message delivery for every other agent until a daemon restart. Observed failure: a crashed container left an orphaned hot journal (outbound.db-journal) beside its outbound.db. drainSession opens outbound.db read-only (single-writer invariant), but rolling back the hot journal requires a write, so even the SELECT in getDueOutboundMessages threw "attempt to write a readonly database" on every tick (~1.3s), poisoning delivery for all sessions ordered after the broken one. A monitoring agent on another session stopped receiving its scheduled tasks and stopped delivering alerts for hours. Catch and log per session in deliverSessionMessages so a single unhealthy session is contained. The broken session self-heals on its next container start, when the writer opens the DB read-write and rolls the journal back. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent ee7f891 commit 28eb038

1 file changed

Lines changed: 21 additions & 0 deletions

File tree

src/delivery.ts

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,27 @@ export async function deliverSessionMessages(session: Session): Promise<void> {
159159

160160
try {
161161
await drainSession(session);
162+
} catch (err) {
163+
// Isolate per-session delivery failures. The active/sweep poll loops call
164+
// this for every session in a plain for-loop; an unhandled throw here
165+
// aborts the entire tick, so a single unhealthy session silently stalls
166+
// delivery for every other agent until the daemon restarts.
167+
//
168+
// The known trigger: a crashed container can leave an orphaned hot journal
169+
// (outbound.db-journal) next to its outbound.db. drainSession opens that DB
170+
// read-only (single-writer invariant), but even a read SELECT must roll the
171+
// journal back — a write — which fails with "attempt to write a readonly
172+
// database". That throw then poisons delivery for all sessions ordered
173+
// after the broken one in the loop.
174+
//
175+
// Containment: log and move on. The broken session self-heals on its next
176+
// container start (the writer opens the DB read-write and rolls the journal
177+
// back), instead of taking the whole install down with it.
178+
log.error('Session delivery failed, skipping until next tick', {
179+
sessionId: session.id,
180+
agentGroupId: session.agent_group_id,
181+
err,
182+
});
162183
} finally {
163184
inflightDeliveries.delete(session.id);
164185
}

0 commit comments

Comments
 (0)