Skip to content

Inbox messages stuck in PENDING when receiving agent is already idle #131

@kdzhang

Description

@kdzhang

Summary

Inbox messages can get stuck in PENDING indefinitely when the receiving agent is already idle at the time the message is posted. This affects all providers — Kiro CLI, Claude Code, etc. — because the issue is in the delivery architecture, not in any specific provider's status detection.

Provider: Kiro CLI (but could be provider-agnostic)

Impact

  • Agent-to-agent messaging silently fails — messages stay PENDING forever
  • Multi-agent workflows stall waiting for callbacks that were sent but never delivered
  • Requires manual intervention (resending the message) to unblock

Reproduction

  1. Start cao-server and a multi-agent session with 3+ agents
  2. Agent A finishes work and goes idle (no more log output)
  3. Agent B calls send_message to Agent A
  4. Message stays PENDING — Agent A never receives it

This happens intermittently in long-running sessions (4-8 hours) with multiple concurrent agents. We observe it several times per session.

Root Cause

The inbox has two delivery paths:

Path 1 — Immediate delivery (on POST): POST /terminals/{id}/inbox/messages calls check_and_send_pending_messages(receiver_id), which calls provider.get_status(). If IDLE or COMPLETED, delivers immediately. This is a single-shot attempt with no retry. If get_status() returns a stale or incorrect status at that moment, delivery is skipped.

Path 2 — PollingObserver: Monitors TERMINAL_LOG_DIR for .log file changes every 5 seconds. On change → check pending → check idle → deliver. But if the agent is already idle and not producing output, the log file doesn't change, so the observer never fires again.

The gap: If Path 1 fails (stale status at the wrong moment) and the agent is already idle (Path 2 never triggers), the message is permanently orphaned. There is no fallback mechanism.

Possible Directions

  • A periodic background check for orphaned PENDING messages (similar to the existing flow_daemon() pattern)
  • Retry logic on the immediate delivery path (e.g., a few attempts with short delays)
  • A fallback poll triggered when a new message is queued but the watcher hasn't fired within N seconds

Related Issues

Both improve get_status() accuracy, but this issue is distinct: even with perfect status detection, the single-shot immediate delivery can miss due to timing, and there is no fallback when it does.

Environment

  • cao-server at commit 331e8d7
  • macOS, Kiro CLI provider
  • Observed across multiple multi-day sessions

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions