Skip to content

Add resume-based retry mechanism to run_claude in archon-loop.sh #12

@surenny

Description

@surenny

Problem

archon-loop.sh's run_claude() has no retry mechanism. If claude -p crashes mid-session (API error, rate limit, network timeout), the call silently fails via || true and the iteration continues with missing output. This wastes the work the agent already completed before the crash.

Proposed Solution

Use Claude Code's --resume <session-id> + -p to resume crashed sessions instead of restarting from scratch.

Key design:

  1. Pre-assign session ID via uuidgen + --session-id on first attempt
  2. Detect crash by checking if session_end event exists in the JSONL log
  3. Resume with --resume <session-id> + -p "Continue your task from where you left off" — Claude picks up from the interruption point with full conversation context
  4. Exponential backoff between retries (default: 30s → 60s → 120s)
  5. Cap retries at 2 (3 total attempts)

Verified:

  • --resume <session-id> works with -p mode (tested: session context is preserved)
  • session_id is available in stream-json output's result event
  • Resume prompt + existing context is enough for Claude to continue without restarting

New CLI flags:

--max-retries N        Max retries per claude -p call (default: 2)
--retry-backoff N      Initial backoff seconds, doubles each retry (default: 30)

What this does NOT retry:

  • Normal completion with poor results (agent gave up, wrong approach) — not a crash
  • --resume itself failing (session expired) — exhausts retry budget and returns error

Impact

All phases benefit (plan, prover, review) since retry is inside run_claude(). Parallel provers retry independently in their subshells.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions