-
Notifications
You must be signed in to change notification settings - Fork 81
Inbox messages stuck in PENDING when receiving agent is already idle #131
Description
Summary
Inbox messages can get stuck in PENDING indefinitely when the receiving agent is already idle at the time the message is posted. This affects all providers — Kiro CLI, Claude Code, etc. — because the issue is in the delivery architecture, not in any specific provider's status detection.
Provider: Kiro CLI (but could be provider-agnostic)
Impact
- Agent-to-agent messaging silently fails — messages stay PENDING forever
- Multi-agent workflows stall waiting for callbacks that were sent but never delivered
- Requires manual intervention (resending the message) to unblock
Reproduction
- Start
cao-serverand a multi-agent session with 3+ agents - Agent A finishes work and goes idle (no more log output)
- Agent B calls
send_messageto Agent A - Message stays PENDING — Agent A never receives it
This happens intermittently in long-running sessions (4-8 hours) with multiple concurrent agents. We observe it several times per session.
Root Cause
The inbox has two delivery paths:
Path 1 — Immediate delivery (on POST): POST /terminals/{id}/inbox/messages calls check_and_send_pending_messages(receiver_id), which calls provider.get_status(). If IDLE or COMPLETED, delivers immediately. This is a single-shot attempt with no retry. If get_status() returns a stale or incorrect status at that moment, delivery is skipped.
Path 2 — PollingObserver: Monitors TERMINAL_LOG_DIR for .log file changes every 5 seconds. On change → check pending → check idle → deliver. But if the agent is already idle and not producing output, the log file doesn't change, so the observer never fires again.
The gap: If Path 1 fails (stale status at the wrong moment) and the agent is already idle (Path 2 never triggers), the message is permanently orphaned. There is no fallback mechanism.
Possible Directions
- A periodic background check for orphaned PENDING messages (similar to the existing
flow_daemon()pattern) - Retry logic on the immediate delivery path (e.g., a few attempts with short delays)
- A fallback poll triggered when a new message is queued but the watcher hasn't fired within N seconds
Related Issues
- Claude Code inbox delivery can get stuck in PENDING due to stale PROCESSING status detection #104 — it seems to fix stale PROCESSING detection in Claude Code specifically (PR fix: prevent stale processing spinners from blocking Claude Code inbox delivery #106)
- PR fix: Synchronize status detection with response completion #62 — added position-based status comparison to Kiro CLI / Q CLI
Both improve get_status() accuracy, but this issue is distinct: even with perfect status detection, the single-shot immediate delivery can miss due to timing, and there is no fallback when it does.
Environment
cao-serverat commit331e8d7- macOS, Kiro CLI provider
- Observed across multiple multi-day sessions