fix(delivery): isolate per-session failures so one bad session can't stall delivery for all by mashkovtsevlx · Pull Request #2797 · nanocoai/nanoclaw

mashkovtsevlx · 2026-06-17T13:31:24Z

Problem

pollActive/pollSweep iterate every session in a plain for loop wrapped in a single try/catch. deliverSessionMessages re-threw on failure, so an unhandled error for one session aborted the entire delivery tick and silently halted message delivery for every other agent until a daemon restart.

The trigger we hit in production: a crashed container left an orphaned hot journal (outbound.db-journal). drainSession opens outbound.db read-only (single-writer invariant), but rolling back a hot journal requires a write — so even the SELECT in getDueOutboundMessages threw attempt to write a readonly database on every tick (~1.3s), poisoning delivery for all sessions ordered after the broken one. An unrelated monitoring agent stopped receiving scheduled tasks and delivering messages for hours, despite its own container being perfectly healthy.

Fix

Catch and log at the per-session boundary in deliverSessionMessages, so a single unhealthy session is contained instead of taking the whole install down. The broken session self-heals on its next container start, when the writer opens the DB read-write and rolls the journal back.

Minimal and surgical — no behavior change for healthy sessions; the existing inflightDeliveries cleanup stays in finally.

Testing

Deployed to a live multi-agent Raspberry Pi install that was exhibiting the stall. After deploy: delivery errors dropped to zero, the previously-starved monitoring agent immediately resumed receiving its cron tasks and delivering messages, and the unhealthy session recovered on its next container spawn.

Follow-ups (not in this PR)

Proactively detect/clear orphaned hot journals for dead sessions.
Reconcile sessions.container_status = 'running' when the container no longer exists.

…stall delivery for all The active and sweep delivery poll loops iterate every session in a plain for-loop wrapped in a single try/catch. deliverSessionMessages re-threw on failure, so an unhandled error for one session aborted the entire tick and silently halted message delivery for every other agent until a daemon restart. Observed failure: a crashed container left an orphaned hot journal (outbound.db-journal) beside its outbound.db. drainSession opens outbound.db read-only (single-writer invariant), but rolling back the hot journal requires a write, so even the SELECT in getDueOutboundMessages threw "attempt to write a readonly database" on every tick (~1.3s), poisoning delivery for all sessions ordered after the broken one. A monitoring agent on another session stopped receiving its scheduled tasks and stopped delivering alerts for hours. Catch and log per session in deliverSessionMessages so a single unhealthy session is contained. The broken session self-heals on its next container start, when the writer opens the DB read-write and rolls the journal back. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mashkovtsevlx · 2026-06-17T13:42:29Z

Closing as a duplicate of #2750, which is more complete (it actively recovers stranded journals via recoverOutboundJournal and classifies the transient hot-journal race, where this PR only isolated the symptom) and is already in review with tests. I should have searched existing issues/PRs first — #2640 and #2516 already cover this. Corroborating production data point and two follow-up notes left on #2750.

mashkovtsevlx requested review from gabi-simons and gavrielc as code owners June 17, 2026 13:31

mashkovtsevlx mentioned this pull request Jun 17, 2026

fix: recover stale outbound.db journals after container kills; classify hot-journal poll races (#2516, #2640) #2750

Open

mashkovtsevlx closed this Jun 17, 2026

This was referenced Jun 18, 2026

🦞 OpenClaw 生态日报 2026-06-18 zx0828/big_model_radar#142

Open

🦞 OpenClaw 生态日报 2026-06-18 96loveslife/big_model_radar#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(delivery): isolate per-session failures so one bad session can't stall delivery for all#2797

fix(delivery): isolate per-session failures so one bad session can't stall delivery for all#2797
mashkovtsevlx wants to merge 1 commit into
nanocoai:mainfrom
mashkovtsevlx:fix/isolate-per-session-delivery-errors

mashkovtsevlx commented Jun 17, 2026

Uh oh!

mashkovtsevlx commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mashkovtsevlx commented Jun 17, 2026

Problem

Fix

Testing

Follow-ups (not in this PR)

Uh oh!

mashkovtsevlx commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant