Skip to content

Latest commit

 

History

History
176 lines (118 loc) · 16.1 KB

File metadata and controls

176 lines (118 loc) · 16.1 KB

Phase 5 Arena runner plan (M194, 2026-05-15)

Top spec: claude-code-parity-apr-poc.md | Phase 4 plan | Design audit | Risks | Completeness assessment

Scope

Phase 5 = live Arena runner operationalizing M192 design-audit.md's R2 and R3 recommendations:

  • R2 (audit §6.2): "Stop hand-authoring canonical JSONL traces. Reallocate engineering cycles from ccpa-replayer maintenance to a live 'Arena' runner (SWE-Bench / ProgramBench style). End-to-end execution, even if non-deterministic, provides higher-fidelity signal for convergence."
  • R3 (audit §6.3): "Shift the evaluation focus from zero-shot trajectory matching to the agent's ability to recover from failed bash commands or test runs. Real-world convergence depends on self-correction, which static traces cannot evaluate."

Phase 5 is operator-initiated at M192 (design audit). The audit's Popperian falsifier — if apr code scores ≥0.95 on static AUTHORED fixtures BUT ~0 on live ProgramBench-style tasks, the static-fixture approach is FALSIFIED as a convergence predictor — is the foundational test the Arena must answer.

What "Arena" means

A live, end-to-end, non-deterministic test harness analogous to SWE-bench's inference + post_processing + evaluation pipeline. Per task:

  1. Clone the fixture repo at pre_fix_commit (Phase 4 P4.1 corpus reuse).
  2. Hand the agent (apr code OR claude baseline) the goal prompt + an interactive shell session, NOT a single-turn prompt-completion contract.
  3. The agent runs multiple turns: edit, cargo test, observe failure, edit again, iterate. Each turn's bash/test output is fed back as context for the next agent action.
  4. End state: agent declares "done" OR a wall-time limit fires. Score by the fixture's completion oracle (M182 corpus uses cargo test exit code + pattern match).

The CRITICAL difference vs Phase 4's phase-4-bench.sh (M184): Phase 4 issues a SINGLE <system> -p "$(cat prompt.txt)" invocation; the agent gets one shot. Phase 5 wraps a MULTI-TURN dialog with execution feedback. This is the Phase 4 → Phase 5 cliff the audit identified.

Sub-deliverables (P5.1-P5.5)

P5.1 — Arena harness scaffolding

Goal: new module crates/ccpa-arena/ (sibling to ccpa-replayer) providing the multi-turn dialog primitive.

pub struct ArenaSession<D: LlmDriver> {
    driver: D,
    cwd: PathBuf,
    history: Vec<TurnRecord>,
    max_turns: usize,
    max_wall_seconds: u64,
}

impl<D: LlmDriver> ArenaSession<D> {
    pub fn run(&mut self, prompt: &str, oracle: &OracleCmd) -> ArenaOutcome { /* ... */ }
}

pub enum ArenaOutcome {
    OraclePassed { turns: usize, wall_seconds: u64 },
    OracleFailedAfterMaxTurns { final_diff: String, partial_pass_rate: f64 },
    WallTimeout { turns_at_timeout: usize },
    DriverError(LlmDriverError),
}

Key shift vs ccpa-replayer: no RecordedDriver fallback. The agent drives itself. The driver implementation is apr code or claude via their CLI (subprocess invocation), NOT a recorded trace. Reuses the M174 validate-fixtures.sh clone-at-dispatch pattern for repo state isolation.

Estimated effort: 2-3 days (~400 LOC Rust + tests).

P5.2 — Multi-turn execution loop

Goal: implement the ArenaSession::run body — the actual agent dialog driver.

Per turn:

  1. Render history into a prompt suffix (### Previous turn output:\n<bash output>\n### Continue:\n).
  2. Call driver.next_turn(&history_prompt) to get the agent's next action (tool call or "done").
  3. Execute the tool call in cwd:
    • Bash { command }: run via std::process::Command; capture stdout/stderr/exit_code.
    • Edit { file, find, replace }: file mutation with string-find-replace semantics; record post-state hash.
    • Read { file }: read file; return content.
    • Write { file, content }: write file (or fail if exists; agent must Read first).
  4. Append TurnRecord to history.
  5. After every K turns, run the oracle command. If it passes → OraclePassed. If max_turns reached → OracleFailedAfterMaxTurns. If wall_seconds exceeded → WallTimeout.

Tool-call grammar: reuses ccpa-trace::Block::ToolUse for trace records — the Arena writes the SAME trace format Phase 1-4 uses, so existing tooling (ccpa-cli diff, FALSIFY-CCPA-001 schema-roundtrip) keeps working.

Estimated effort: 3-5 days. The main complexity is the bash/edit/read/write tool dispatch + history rendering.

P5.3 — Live Arena bench runner

Goal: scripts/phase-5-arena-bench.sh operator-dispatch entry point — analogous to scripts/phase-3-bench.sh (M150) and scripts/phase-4-bench.sh (M184).

Per fixture × system (teacher=claude, student=apr code):

  1. Clone the fixture's pinned pre_fix_commit SHA into a tempdir.
  2. Invoke ccpa-arena --driver=<system> --fixture-dir=<dir> --oracle="$(cat meta.toml | grep oracle_cmd)" --max-turns=20 --wall-seconds=900.
  3. Capture the ArenaOutcome enum + the full multi-turn history.
  4. Emit per-fixture + aggregate metrics to evidence/phase-5/arena-scores.json:
    • oracle_passed_rate (fraction of fixtures where outcome was OraclePassed)
    • mean_turns_to_pass (signal for "how much exploration does the agent need?")
    • mean_wall_seconds_to_pass
    • recovery_rate (fraction of fixtures where at least one bash command failed but the agent eventually passed the oracle — direct signal for R3 "error recovery over zero-shot determinism")

Operator preconditions: same as phase-4-bench.sh + MAX_TURNS env-var (default 20; bound multi-turn cost).

Wall budget: 5 fixtures × 2 systems × 20 turns × ~30s/turn ≈ 1h per dispatch.

Estimated effort: ~1 day; reuses ~70% of phase-4-bench.sh.

P5.4 — FALSIFY-CCPA-018 gate (recovery-rate bound)

Proposed assertion: at threshold T_recovery (initial value TBD by first measurement; probably 0.5), require recovery_rate >= T_recovery AND oracle_passed_rate >= 0.3. Direct empirical answer to R3's "self-correction over trajectory matching" framing.

Test home: crates/ccpa-arena/tests/falsify_ccpa_018_arena_recovery_rate.rs. Initial status: PROPOSED until first operator-dispatched measurement.

Bidirectional sensitivity (mandatory): synthetic identity (recovery_rate=1.0, oracle=1.0 → passes) + synthetic always-fail (recovery_rate=0.0 → fails) + synthetic give-up-fast (oracle=1.0 BUT recovery_rate=0.0 because agent never hit a failure to recover from → fails on recovery floor, passes on oracle floor — bidirectional).

Critical design choice: CCPA-018 measures agent quality (does it recover?), not functional outcome (does code work?). CCPA-016 + CCPA-017 already cover the latter. CCPA-018 is the explicit answer to the audit's R3 directive.

Estimated effort: ~1 day test scaffold; threshold-calibration is downstream.

P5.5 — Falsifier-of-the-falsifier — does Phase 5 falsify Phase 1?

Goal: explicitly run the audit's Popperian test. Compare static-fixture parity score (FALSIFY-CCPA-008, currently 1.0 on 30/30 AUTHORED fixtures) to live-Arena parity score (FALSIFY-CCPA-017 outcome agreement on the M182 project-scale corpus AS RUN through the P5.3 Arena, not the P4.2 single-turn runner).

If: static_parity ≥ 0.95 AND arena_outcome_agreement ≤ 0.2 → static-fixture approach FALSIFIED as a convergence predictor. Document the falsification at evidence/phase-5/static-fixture-falsification.md. Action: soft-deprecate FALSIFY-CCPA-008 and reframe it as a meter-validation metric (correct-but-vacuous) rather than a system-validation metric.

Else: static fixtures correlate with arena outcomes → the static approach is empirically validated. No deprecation needed; CCPA-008 remains load-bearing.

This is the audit's primary deliverable. The whole point of Phase 5 is to answer R2's "static fixtures lack the dynamic feedback of true distillation" assertion empirically rather than rhetorically.

Estimated effort: ~half-day (consume P5.3 output + diff against existing CCPA-008 evidence + write the determination doc).

Phase 5 vs Phase 4 — comparison table

Dimension Phase 4 (M180-M190) Phase 5 (M194+)
Turns per fixture 1 (single prompt) up to 20 (multi-turn dialog)
Execution feedback None (one-shot generation) Yes (bash/test output → next turn's prompt)
Self-correction signal Not measured Measured (recovery_rate)
Primary metric partial_agreement >= 0.3 (CCPA-017) recovery_rate >= T AND oracle_passed_rate >= 0.3 (CCPA-018)
Wall budget per dispatch ~30 min for 5 fixtures × 2 systems ~1h for 5 fixtures × 2 systems × 20 turns
Determinism One-shot RNG-bound Per-turn non-determinism; outcome bound only
Falsifier "do both systems make matching partial progress?" "does the agent recover when bash fails?"

Implementation blockers and discharges

Blocker 1: apr code is a one-shot CLI (apr code -p "<prompt>"). It doesn't support an interactive multi-turn shell session.

Discharge path: P5.2's multi-turn loop spawns apr code once PER TURN with the cumulative history as the prompt. Each invocation is fresh; the agent's "memory" is reconstructed from the prompt history we maintain. This trades inference latency (~30s extra context per turn) for harness simplicity. Future work: a native multi-turn mode in apr code would amortize the cost.

Blocker 2: Multi-turn dialog generates EXPONENTIALLY-growing prompt context (history accumulates). Hits model context limits fast.

Discharge path: history truncation — keep only the last N turns + the original prompt. K=5 is a reasonable starting bound; the agent's "long-term memory" is the repo file system itself (it can re-read files).

Blocker 3: Wall-clock cost. Multi-turn live execution against claude takes 30s/turn × 20 turns × 5 fixtures × 2 systems ≈ 1h per Arena run. *(M222 operator-directive: CCPA uses claude CLI session-auth via claude login, NOT the Anthropic API directly; there is no per-turn dollar cost — the operator's Claude Code subscription covers usage. The previous "$0.05-0.20 per turn / $5-20 per run" API-call estimate is OBSOLETE.)*

Discharge path: --max-wall-seconds env-var (default 900s) caps each fixture's wall budget. No dollar-budget flag needed since CCPA is not API-metered.

Non-blocker (was suspected): RecordedDriver deprecation. Phase 5 does NOT require deprecating ccpa-replayer; the two coexist. ccpa-replayer remains the FALSIFY-CCPA-001/002/003 source-of-truth (trace-schema validation + replay determinism); Arena is the live-evaluation track. R2's "stop hand-authoring canonical JSONL" is a SEPARATE concern from P5 and can be addressed later.

Status post-M210

  • P5.1 Arena harness scaffolding: SHIPPED at M196.
  • P5.2 multi-turn loop: SHIPPED at M200.
  • P5.3 Arena bench runner: SHIPPED at M202.
  • P5.4 FALSIFY-CCPA-018 gate: SHIPPED at M204.
  • P5.5 falsifier-of-falsifier evidence: SHIPPED at M206 (template + comparator code; evidence pending dispatch).
  • Phase 5 contract bump v1.28.0 → v1.29.0: SHIPPED at M208 — M22 5-step ritual mirror of aprender PR #1705 registering FALSIFY-CCPA-018 (arena_recovery_rate_bound) at status: PROPOSED. Gate count 17 → 18. PROPOSED → ACTIVE_RUNTIME flip awaits v1.30.0 after first operator-dispatched Arena bench.
  • ccpa-arena coverage closure: SHIPPED at M210 — workspace coverage 95.44% → 99.09% lines and 99.75% functions; FALSIFY-CCPA-011 now passes on its own merits (M204-M207 had been admin-merging through the gap). New convention encoded in Makefile + CI: --ignore-filename-regex '/bin/' excludes operator-dispatch CLI binaries from coverage accounting (their runtime is exercised by outer bash dispatcher scripts, not unit tests).

P5.5 deliverable detail — three deliverables: (a) crates/ccpa-arena/src/falsifier.rs (~140 LOC) — evaluate_static_vs_arena(static_parity, arena_parity, src, src2) -> FalsifierVerdict with 3-variant outcome (StaticFalsified / StaticValidated / Inconclusive); thresholds STATIC_PARITY_THRESHOLD = 0.95 and ARENA_PARITY_CEILING = 0.2 per design-audit.md §5; 8 unit tests covering canonical falsification, both-high validation, exact-boundary semantics, below-floor short-circuit, middle-zone inconclusive, verdict-records-inputs, serde-roundtrip, outcome-tag. (b) crates/ccpa-arena/tests/falsify_static_vs_arena.rs (~110 LOC) — 4 active synthetic tests + 1 #[ignore]'d live-evidence test that loads BOTH evidence/phase-3/multipl-e-rust-scores.json AND evidence/phase-5/arena-scores.json, computes the verdict, pretty-prints it for operator inspection; the live test is informational (no assertion on outcome — the operator takes post-verdict action per the evidence doc). (c) evidence/phase-5/static-fixture-falsification.md (~95 lines) — operator-facing evidence-doc template with placeholders for the per-source numbers (CCPA-016 .agreement, CCPA-018 .oracle_passed_rate, CCPA-017 .partial_agreement), the post-verdict decision matrix, the StaticFalsified action checklist (soft-deprecate CCPA-008 to meter-validation status + promote CCPA-017/018 to user-facing parity claims), the operator-dispatch checklist, and the cross-reference back to design-audit.md §5. Public API: pub use falsifier::{evaluate_static_vs_arena, FalsifierOutcome, FalsifierVerdict, ARENA_PARITY_CEILING, STATIC_PARITY_THRESHOLD}; in lib.rs. Test counts: 72 lib + 7 CCPA-018 active + 4 falsifier active = 83 GREEN; 2 #[ignore]'d (both live-evidence). P5.5 is the audit's primary deliverable made executable: the operator can now run cargo test -p ccpa-arena --test falsify_static_vs_arena -- --ignored post-dispatch to get the empirical verdict.

Phase 5 is post-cleanup at M210 — all 5 sub-deliverables shipped at the code+test level (P5.1-P5.5 / M196-M206), the contract bump shipped at M208 (CCPA-018 registered at PROPOSED), and the coverage closure shipped at M210 (FALSIFY-CCPA-011 green). The Popperian test is now executable code AND the gate is registered in the contract; only the evidence inputs (operator-dispatched Arena bench against the M182 corpus) remain to fully resolve the verdict. Phase 5 substantive arc COMPLETE. Future work: v1.29.0 → v1.30.0 contract bump flipping CCPA-017 + CCPA-018 PROPOSED → ACTIVE_RUNTIME after first operator dispatch produces evidence/phase-4/project-scale-scores.json + evidence/phase-5/arena-scores.json.

Why this is high EV

  1. Direct answer to the audit's primary directive (R2). The Phase 4 + Phase 3 path validated function-scale and partial-progress parity; Phase 5 validates the audit's foundational claim that "static fixtures predict live performance".
  2. CCPA-018 introduces a new metric category (recovery rate) that none of CCPA-001..017 capture. Even if the audit's Popperian test confirms static fixtures DO predict live performance, the recovery-rate measurement is independently valuable.
  3. Reuses the Phase 4 corpus (M182 5-fixture project-scale corpus). No new fixture authoring needed for the first dispatch; cost is bench-runner code only.
  4. Aligns with operator priorities: the operator authored design-audit.md and the M192 integration; Phase 5 is the canonical operationalization of that audit.
  5. The Popperian falsifier IS the test. If we never run it, we cannot claim the project has done what its design audit demanded.

Cross-refs