Phase 5 Arena runner plan (M194, 2026-05-15)

Top spec: claude-code-parity-apr-poc.md | Phase 4 plan | Design audit | Risks | Completeness assessment

Scope

Phase 5 = live Arena runner operationalizing M192 design-audit.md's R2 and R3 recommendations:

R2 (audit §6.2): "Stop hand-authoring canonical JSONL traces. Reallocate engineering cycles from ccpa-replayer maintenance to a live 'Arena' runner (SWE-Bench / ProgramBench style). End-to-end execution, even if non-deterministic, provides higher-fidelity signal for convergence."
R3 (audit §6.3): "Shift the evaluation focus from zero-shot trajectory matching to the agent's ability to recover from failed bash commands or test runs. Real-world convergence depends on self-correction, which static traces cannot evaluate."

Phase 5 is operator-initiated at M192 (design audit). The audit's Popperian falsifier — if apr code scores ≥0.95 on static AUTHORED fixtures BUT ~0 on live ProgramBench-style tasks, the static-fixture approach is FALSIFIED as a convergence predictor — is the foundational test the Arena must answer.

What "Arena" means

A live, end-to-end, non-deterministic test harness analogous to SWE-bench's inference + post_processing + evaluation pipeline. Per task:

Clone the fixture repo at pre_fix_commit (Phase 4 P4.1 corpus reuse).
Hand the agent (apr code OR claude baseline) the goal prompt + an interactive shell session, NOT a single-turn prompt-completion contract.
The agent runs multiple turns: edit, cargo test, observe failure, edit again, iterate. Each turn's bash/test output is fed back as context for the next agent action.
End state: agent declares "done" OR a wall-time limit fires. Score by the fixture's completion oracle (M182 corpus uses cargo test exit code + pattern match).

The CRITICAL difference vs Phase 4's phase-4-bench.sh (M184): Phase 4 issues a SINGLE <system> -p "$(cat prompt.txt)" invocation; the agent gets one shot. Phase 5 wraps a MULTI-TURN dialog with execution feedback. This is the Phase 4 → Phase 5 cliff the audit identified.

Sub-deliverables (P5.1-P5.5)

P5.1 — Arena harness scaffolding

Goal: new module crates/ccpa-arena/ (sibling to ccpa-replayer) providing the multi-turn dialog primitive.

pub struct ArenaSession<D: LlmDriver> {
    driver: D,
    cwd: PathBuf,
    history: Vec<TurnRecord>,
    max_turns: usize,
    max_wall_seconds: u64,
}

impl<D: LlmDriver> ArenaSession<D> {
    pub fn run(&mut self, prompt: &str, oracle: &OracleCmd) -> ArenaOutcome { /* ... */ }
}

pub enum ArenaOutcome {
    OraclePassed { turns: usize, wall_seconds: u64 },
    OracleFailedAfterMaxTurns { final_diff: String, partial_pass_rate: f64 },
    WallTimeout { turns_at_timeout: usize },
    DriverError(LlmDriverError),
}

Key shift vs ccpa-replayer: no RecordedDriver fallback. The agent drives itself. The driver implementation is apr code or claude via their CLI (subprocess invocation), NOT a recorded trace. Reuses the M174 validate-fixtures.sh clone-at-dispatch pattern for repo state isolation.

Estimated effort: 2-3 days (~400 LOC Rust + tests).

P5.2 — Multi-turn execution loop

Goal: implement the ArenaSession::run body — the actual agent dialog driver.

Per turn:

Render history into a prompt suffix (### Previous turn output:\n<bash output>\n### Continue:\n).
Call driver.next_turn(&history_prompt) to get the agent's next action (tool call or "done").
Execute the tool call in cwd:
- Bash { command }: run via std::process::Command; capture stdout/stderr/exit_code.
- Edit { file, find, replace }: file mutation with string-find-replace semantics; record post-state hash.
- Read { file }: read file; return content.
- Write { file, content }: write file (or fail if exists; agent must Read first).
Append TurnRecord to history.
After every K turns, run the oracle command. If it passes → OraclePassed. If max_turns reached → OracleFailedAfterMaxTurns. If wall_seconds exceeded → WallTimeout.

Tool-call grammar: reuses ccpa-trace::Block::ToolUse for trace records — the Arena writes the SAME trace format Phase 1-4 uses, so existing tooling (ccpa-cli diff, FALSIFY-CCPA-001 schema-roundtrip) keeps working.

Estimated effort: 3-5 days. The main complexity is the bash/edit/read/write tool dispatch + history rendering.

P5.3 — Live Arena bench runner

Goal: scripts/phase-5-arena-bench.sh operator-dispatch entry point — analogous to scripts/phase-3-bench.sh (M150) and scripts/phase-4-bench.sh (M184).

Per fixture × system (teacher=claude, student=apr code):

Clone the fixture's pinned pre_fix_commit SHA into a tempdir.
Invoke ccpa-arena --driver=<system> --fixture-dir=<dir> --oracle="$(cat meta.toml | grep oracle_cmd)" --max-turns=20 --wall-seconds=900.
Capture the ArenaOutcome enum + the full multi-turn history.
Emit per-fixture + aggregate metrics to evidence/phase-5/arena-scores.json:
- oracle_passed_rate (fraction of fixtures where outcome was OraclePassed)
- mean_turns_to_pass (signal for "how much exploration does the agent need?")
- mean_wall_seconds_to_pass
- recovery_rate (fraction of fixtures where at least one bash command failed but the agent eventually passed the oracle — direct signal for R3 "error recovery over zero-shot determinism")

Operator preconditions: same as phase-4-bench.sh + MAX_TURNS env-var (default 20; bound multi-turn cost).

Wall budget: 5 fixtures × 2 systems × 20 turns × ~30s/turn ≈ 1h per dispatch.

Estimated effort: ~1 day; reuses ~70% of phase-4-bench.sh.

P5.4 — FALSIFY-CCPA-018 gate (recovery-rate bound)

Proposed assertion: at threshold T_recovery (initial value TBD by first measurement; probably 0.5), require recovery_rate >= T_recovery AND oracle_passed_rate >= 0.3. Direct empirical answer to R3's "self-correction over trajectory matching" framing.

Test home: crates/ccpa-arena/tests/falsify_ccpa_018_arena_recovery_rate.rs. Initial status: PROPOSED until first operator-dispatched measurement.

Bidirectional sensitivity (mandatory): synthetic identity (recovery_rate=1.0, oracle=1.0 → passes) + synthetic always-fail (recovery_rate=0.0 → fails) + synthetic give-up-fast (oracle=1.0 BUT recovery_rate=0.0 because agent never hit a failure to recover from → fails on recovery floor, passes on oracle floor — bidirectional).

Critical design choice: CCPA-018 measures agent quality (does it recover?), not functional outcome (does code work?). CCPA-016 + CCPA-017 already cover the latter. CCPA-018 is the explicit answer to the audit's R3 directive.

Estimated effort: ~1 day test scaffold; threshold-calibration is downstream.

P5.5 — Falsifier-of-the-falsifier — does Phase 5 falsify Phase 1?

Goal: explicitly run the audit's Popperian test. Compare static-fixture parity score (FALSIFY-CCPA-008, currently 1.0 on 30/30 AUTHORED fixtures) to live-Arena parity score (FALSIFY-CCPA-017 outcome agreement on the M182 project-scale corpus AS RUN through the P5.3 Arena, not the P4.2 single-turn runner).

If: static_parity ≥ 0.95 AND arena_outcome_agreement ≤ 0.2 → static-fixture approach FALSIFIED as a convergence predictor. Document the falsification at evidence/phase-5/static-fixture-falsification.md. Action: soft-deprecate FALSIFY-CCPA-008 and reframe it as a meter-validation metric (correct-but-vacuous) rather than a system-validation metric.

Else: static fixtures correlate with arena outcomes → the static approach is empirically validated. No deprecation needed; CCPA-008 remains load-bearing.

This is the audit's primary deliverable. The whole point of Phase 5 is to answer R2's "static fixtures lack the dynamic feedback of true distillation" assertion empirically rather than rhetorically.

Estimated effort: ~half-day (consume P5.3 output + diff against existing CCPA-008 evidence + write the determination doc).

Phase 5 vs Phase 4 — comparison table

Dimension	Phase 4 (M180-M190)	Phase 5 (M194+)
Turns per fixture	1 (single prompt)	up to 20 (multi-turn dialog)
Execution feedback	None (one-shot generation)	Yes (bash/test output → next turn's prompt)
Self-correction signal	Not measured	Measured (`recovery_rate`)
Primary metric	`partial_agreement >= 0.3` (CCPA-017)	`recovery_rate >= T` AND `oracle_passed_rate >= 0.3` (CCPA-018)
Wall budget per dispatch	~30 min for 5 fixtures × 2 systems	~1h for 5 fixtures × 2 systems × 20 turns
Determinism	One-shot RNG-bound	Per-turn non-determinism; outcome bound only
Falsifier	"do both systems make matching partial progress?"	"does the agent recover when bash fails?"

Implementation blockers and discharges

Blocker 1: apr code is a one-shot CLI (apr code -p "<prompt>"). It doesn't support an interactive multi-turn shell session.

Discharge path: P5.2's multi-turn loop spawns apr code once PER TURN with the cumulative history as the prompt. Each invocation is fresh; the agent's "memory" is reconstructed from the prompt history we maintain. This trades inference latency (~30s extra context per turn) for harness simplicity. Future work: a native multi-turn mode in apr code would amortize the cost.

Blocker 2: Multi-turn dialog generates EXPONENTIALLY-growing prompt context (history accumulates). Hits model context limits fast.

Discharge path: history truncation — keep only the last N turns + the original prompt. K=5 is a reasonable starting bound; the agent's "long-term memory" is the repo file system itself (it can re-read files).

Blocker 3: Wall-clock cost. Multi-turn live execution against claude takes 30s/turn × 20 turns × 5 fixtures × 2 systems ≈ 1h per Arena run. *(M222 operator-directive: CCPA uses claude CLI session-auth via claude login, NOT the Anthropic API directly; there is no per-turn dollar cost — the operator's Claude Code subscription covers usage. The previous "$0.05-0.20 per turn / $5-20 per run" API-call estimate is OBSOLETE.)*

Discharge path: --max-wall-seconds env-var (default 900s) caps each fixture's wall budget. No dollar-budget flag needed since CCPA is not API-metered.

Non-blocker (was suspected): RecordedDriver deprecation. Phase 5 does NOT require deprecating ccpa-replayer; the two coexist. ccpa-replayer remains the FALSIFY-CCPA-001/002/003 source-of-truth (trace-schema validation + replay determinism); Arena is the live-evaluation track. R2's "stop hand-authoring canonical JSONL" is a SEPARATE concern from P5 and can be addressed later.

Status post-M210

P5.1 Arena harness scaffolding: SHIPPED at M196.
P5.2 multi-turn loop: SHIPPED at M200.
P5.3 Arena bench runner: SHIPPED at M202.
P5.4 FALSIFY-CCPA-018 gate: SHIPPED at M204.
P5.5 falsifier-of-falsifier evidence: SHIPPED at M206 (template + comparator code; evidence pending dispatch).
Phase 5 contract bump v1.28.0 → v1.29.0: SHIPPED at M208 — M22 5-step ritual mirror of aprender PR #1705 registering FALSIFY-CCPA-018 (arena_recovery_rate_bound) at status: PROPOSED. Gate count 17 → 18. PROPOSED → ACTIVE_RUNTIME flip awaits v1.30.0 after first operator-dispatched Arena bench.
ccpa-arena coverage closure: SHIPPED at M210 — workspace coverage 95.44% → 99.09% lines and 99.75% functions; FALSIFY-CCPA-011 now passes on its own merits (M204-M207 had been admin-merging through the gap). New convention encoded in Makefile + CI: --ignore-filename-regex '/bin/' excludes operator-dispatch CLI binaries from coverage accounting (their runtime is exercised by outer bash dispatcher scripts, not unit tests).

P5.5 deliverable detail — three deliverables: (a) `crates/ccpa-arena/src/falsifier.rs` (~140 LOC) — `evaluate_static_vs_arena(static_parity, arena_parity, src, src2) -> FalsifierVerdict` with 3-variant outcome (StaticFalsified / StaticValidated / Inconclusive); thresholds `STATIC_PARITY_THRESHOLD = 0.95` and `ARENA_PARITY_CEILING = 0.2` per design-audit.md §5; 8 unit tests covering canonical falsification, both-high validation, exact-boundary semantics, below-floor short-circuit, middle-zone inconclusive, verdict-records-inputs, serde-roundtrip, outcome-tag. (b) `crates/ccpa-arena/tests/falsify_static_vs_arena.rs` (~110 LOC) — 4 active synthetic tests + 1 `#[ignore]`'d live-evidence test that loads BOTH `evidence/phase-3/multipl-e-rust-scores.json` AND `evidence/phase-5/arena-scores.json`, computes the verdict, pretty-prints it for operator inspection; the live test is informational (no assertion on outcome — the operator takes post-verdict action per the evidence doc). (c) `evidence/phase-5/static-fixture-falsification.md` (~95 lines) — operator-facing evidence-doc template with placeholders for the per-source numbers (CCPA-016 `.agreement`, CCPA-018 `.oracle_passed_rate`, CCPA-017 `.partial_agreement`), the post-verdict decision matrix, the StaticFalsified action checklist (soft-deprecate CCPA-008 to meter-validation status + promote CCPA-017/018 to user-facing parity claims), the operator-dispatch checklist, and the cross-reference back to design-audit.md §5. Public API: `pub use falsifier::{evaluate_static_vs_arena, FalsifierOutcome, FalsifierVerdict, ARENA_PARITY_CEILING, STATIC_PARITY_THRESHOLD};` in lib.rs. Test counts: 72 lib + 7 CCPA-018 active + 4 falsifier active = 83 GREEN; 2 `#[ignore]`'d (both live-evidence). P5.5 is the audit's primary deliverable made executable: the operator can now run `cargo test -p ccpa-arena --test falsify_static_vs_arena -- --ignored` post-dispatch to get the empirical verdict.

Phase 5 is post-cleanup at M210 — all 5 sub-deliverables shipped at the code+test level (P5.1-P5.5 / M196-M206), the contract bump shipped at M208 (CCPA-018 registered at PROPOSED), and the coverage closure shipped at M210 (FALSIFY-CCPA-011 green). The Popperian test is now executable code AND the gate is registered in the contract; only the evidence inputs (operator-dispatched Arena bench against the M182 corpus) remain to fully resolve the verdict. Phase 5 substantive arc COMPLETE. Future work: v1.29.0 → v1.30.0 contract bump flipping CCPA-017 + CCPA-018 PROPOSED → ACTIVE_RUNTIME after first operator dispatch produces evidence/phase-4/project-scale-scores.json + evidence/phase-5/arena-scores.json.

Why this is high EV

Direct answer to the audit's primary directive (R2). The Phase 4 + Phase 3 path validated function-scale and partial-progress parity; Phase 5 validates the audit's foundational claim that "static fixtures predict live performance".
CCPA-018 introduces a new metric category (recovery rate) that none of CCPA-001..017 capture. Even if the audit's Popperian test confirms static fixtures DO predict live performance, the recovery-rate measurement is independently valuable.
Reuses the Phase 4 corpus (M182 5-fixture project-scale corpus). No new fixture authoring needed for the first dispatch; cost is bench-runner code only.
Aligns with operator priorities: the operator authored design-audit.md and the M192 integration; Phase 5 is the canonical operationalization of that audit.
The Popperian falsifier IS the test. If we never run it, we cannot claim the project has done what its design audit demanded.

Cross-refs

design-audit.md — operator-authored critique that motivates Phase 5
phase-4-project-scale-plan.md — Phase 4 plan (P4.1-P4.5); the static-fixture/single-turn baseline Phase 5 compares against
outcome-parity-plan.md — Phase 3 plan; the function-scale baseline that the audit's R1 critique targets
risks.md § M192 amendment — Popperian falsifier as a meta-risk
SWE-bench (arXiv:2310.06770) — Jimenez et al. 2024, the canonical live-execution benchmark Phase 5 emulates
ProgramBench (arXiv:2605.03546) — Yang et al. 2026, the 0%/200 SOTA-model baseline at project-scale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phase 5 Arena runner plan (M194, 2026-05-15)

Scope

What "Arena" means

Sub-deliverables (P5.1-P5.5)

P5.1 — Arena harness scaffolding

P5.2 — Multi-turn execution loop

P5.3 — Live Arena bench runner

P5.4 — FALSIFY-CCPA-018 gate (recovery-rate bound)

P5.5 — Falsifier-of-the-falsifier — does Phase 5 falsify Phase 1?

Phase 5 vs Phase 4 — comparison table

Implementation blockers and discharges

Status post-M210

Why this is high EV

Cross-refs

Uh oh!

FilesExpand file tree

phase-5-arena-runner-plan.md

Latest commit

History

phase-5-arena-runner-plan.md

File metadata and controls

Phase 5 Arena runner plan (M194, 2026-05-15)

Scope

What "Arena" means

Sub-deliverables (P5.1-P5.5)

P5.1 — Arena harness scaffolding

P5.2 — Multi-turn execution loop

P5.3 — Live Arena bench runner

P5.4 — FALSIFY-CCPA-018 gate (recovery-rate bound)

P5.5 — Falsifier-of-the-falsifier — does Phase 5 falsify Phase 1?

Phase 5 vs Phase 4 — comparison table

Implementation blockers and discharges

Status post-M210

Why this is high EV

Cross-refs