Top spec: claude-code-parity-apr-poc.md | Phase 4 plan | Design audit | Risks | Completeness assessment
Phase 5 = live Arena runner operationalizing M192 design-audit.md's R2 and R3 recommendations:
- R2 (audit §6.2): "Stop hand-authoring canonical JSONL traces. Reallocate engineering cycles from
ccpa-replayermaintenance to a live 'Arena' runner (SWE-Bench / ProgramBench style). End-to-end execution, even if non-deterministic, provides higher-fidelity signal for convergence." - R3 (audit §6.3): "Shift the evaluation focus from zero-shot trajectory matching to the agent's ability to recover from failed bash commands or test runs. Real-world convergence depends on self-correction, which static traces cannot evaluate."
Phase 5 is operator-initiated at M192 (design audit). The audit's Popperian falsifier — if apr code scores ≥0.95 on static AUTHORED fixtures BUT ~0 on live ProgramBench-style tasks, the static-fixture approach is FALSIFIED as a convergence predictor — is the foundational test the Arena must answer.
A live, end-to-end, non-deterministic test harness analogous to SWE-bench's inference + post_processing + evaluation pipeline. Per task:
- Clone the fixture repo at
pre_fix_commit(Phase 4 P4.1 corpus reuse). - Hand the agent (
apr codeORclaudebaseline) the goal prompt + an interactive shell session, NOT a single-turn prompt-completion contract. - The agent runs multiple turns: edit,
cargo test, observe failure, edit again, iterate. Each turn's bash/test output is fed back as context for the next agent action. - End state: agent declares "done" OR a wall-time limit fires. Score by the fixture's completion oracle (M182 corpus uses
cargo testexit code + pattern match).
The CRITICAL difference vs Phase 4's phase-4-bench.sh (M184): Phase 4 issues a SINGLE <system> -p "$(cat prompt.txt)" invocation; the agent gets one shot. Phase 5 wraps a MULTI-TURN dialog with execution feedback. This is the Phase 4 → Phase 5 cliff the audit identified.
Goal: new module crates/ccpa-arena/ (sibling to ccpa-replayer) providing the multi-turn dialog primitive.
pub struct ArenaSession<D: LlmDriver> {
driver: D,
cwd: PathBuf,
history: Vec<TurnRecord>,
max_turns: usize,
max_wall_seconds: u64,
}
impl<D: LlmDriver> ArenaSession<D> {
pub fn run(&mut self, prompt: &str, oracle: &OracleCmd) -> ArenaOutcome { /* ... */ }
}
pub enum ArenaOutcome {
OraclePassed { turns: usize, wall_seconds: u64 },
OracleFailedAfterMaxTurns { final_diff: String, partial_pass_rate: f64 },
WallTimeout { turns_at_timeout: usize },
DriverError(LlmDriverError),
}Key shift vs ccpa-replayer: no RecordedDriver fallback. The agent drives itself. The driver implementation is apr code or claude via their CLI (subprocess invocation), NOT a recorded trace. Reuses the M174 validate-fixtures.sh clone-at-dispatch pattern for repo state isolation.
Estimated effort: 2-3 days (~400 LOC Rust + tests).
Goal: implement the ArenaSession::run body — the actual agent dialog driver.
Per turn:
- Render history into a prompt suffix (
### Previous turn output:\n<bash output>\n### Continue:\n). - Call
driver.next_turn(&history_prompt)to get the agent's next action (tool call or "done"). - Execute the tool call in
cwd:Bash { command }: run viastd::process::Command; capture stdout/stderr/exit_code.Edit { file, find, replace }: file mutation with string-find-replace semantics; record post-state hash.Read { file }: read file; return content.Write { file, content }: write file (or fail if exists; agent must Read first).
- Append
TurnRecordto history. - After every K turns, run the oracle command. If it passes →
OraclePassed. If max_turns reached →OracleFailedAfterMaxTurns. If wall_seconds exceeded →WallTimeout.
Tool-call grammar: reuses ccpa-trace::Block::ToolUse for trace records — the Arena writes the SAME trace format Phase 1-4 uses, so existing tooling (ccpa-cli diff, FALSIFY-CCPA-001 schema-roundtrip) keeps working.
Estimated effort: 3-5 days. The main complexity is the bash/edit/read/write tool dispatch + history rendering.
Goal: scripts/phase-5-arena-bench.sh operator-dispatch entry point — analogous to scripts/phase-3-bench.sh (M150) and scripts/phase-4-bench.sh (M184).
Per fixture × system (teacher=claude, student=apr code):
- Clone the fixture's pinned
pre_fix_commitSHA into a tempdir. - Invoke
ccpa-arena --driver=<system> --fixture-dir=<dir> --oracle="$(cat meta.toml | grep oracle_cmd)" --max-turns=20 --wall-seconds=900. - Capture the
ArenaOutcomeenum + the full multi-turn history. - Emit per-fixture + aggregate metrics to
evidence/phase-5/arena-scores.json:oracle_passed_rate(fraction of fixtures where outcome wasOraclePassed)mean_turns_to_pass(signal for "how much exploration does the agent need?")mean_wall_seconds_to_passrecovery_rate(fraction of fixtures where at least one bash command failed but the agent eventually passed the oracle — direct signal for R3 "error recovery over zero-shot determinism")
Operator preconditions: same as phase-4-bench.sh + MAX_TURNS env-var (default 20; bound multi-turn cost).
Wall budget: 5 fixtures × 2 systems × 20 turns × ~30s/turn ≈ 1h per dispatch.
Estimated effort: ~1 day; reuses ~70% of phase-4-bench.sh.
Proposed assertion: at threshold T_recovery (initial value TBD by first measurement; probably 0.5), require recovery_rate >= T_recovery AND oracle_passed_rate >= 0.3. Direct empirical answer to R3's "self-correction over trajectory matching" framing.
Test home: crates/ccpa-arena/tests/falsify_ccpa_018_arena_recovery_rate.rs. Initial status: PROPOSED until first operator-dispatched measurement.
Bidirectional sensitivity (mandatory): synthetic identity (recovery_rate=1.0, oracle=1.0 → passes) + synthetic always-fail (recovery_rate=0.0 → fails) + synthetic give-up-fast (oracle=1.0 BUT recovery_rate=0.0 because agent never hit a failure to recover from → fails on recovery floor, passes on oracle floor — bidirectional).
Critical design choice: CCPA-018 measures agent quality (does it recover?), not functional outcome (does code work?). CCPA-016 + CCPA-017 already cover the latter. CCPA-018 is the explicit answer to the audit's R3 directive.
Estimated effort: ~1 day test scaffold; threshold-calibration is downstream.
Goal: explicitly run the audit's Popperian test. Compare static-fixture parity score (FALSIFY-CCPA-008, currently 1.0 on 30/30 AUTHORED fixtures) to live-Arena parity score (FALSIFY-CCPA-017 outcome agreement on the M182 project-scale corpus AS RUN through the P5.3 Arena, not the P4.2 single-turn runner).
If: static_parity ≥ 0.95 AND arena_outcome_agreement ≤ 0.2 → static-fixture approach FALSIFIED as a convergence predictor. Document the falsification at evidence/phase-5/static-fixture-falsification.md. Action: soft-deprecate FALSIFY-CCPA-008 and reframe it as a meter-validation metric (correct-but-vacuous) rather than a system-validation metric.
Else: static fixtures correlate with arena outcomes → the static approach is empirically validated. No deprecation needed; CCPA-008 remains load-bearing.
This is the audit's primary deliverable. The whole point of Phase 5 is to answer R2's "static fixtures lack the dynamic feedback of true distillation" assertion empirically rather than rhetorically.
Estimated effort: ~half-day (consume P5.3 output + diff against existing CCPA-008 evidence + write the determination doc).
| Dimension | Phase 4 (M180-M190) | Phase 5 (M194+) |
|---|---|---|
| Turns per fixture | 1 (single prompt) | up to 20 (multi-turn dialog) |
| Execution feedback | None (one-shot generation) | Yes (bash/test output → next turn's prompt) |
| Self-correction signal | Not measured | Measured (recovery_rate) |
| Primary metric | partial_agreement >= 0.3 (CCPA-017) |
recovery_rate >= T AND oracle_passed_rate >= 0.3 (CCPA-018) |
| Wall budget per dispatch | ~30 min for 5 fixtures × 2 systems | ~1h for 5 fixtures × 2 systems × 20 turns |
| Determinism | One-shot RNG-bound | Per-turn non-determinism; outcome bound only |
| Falsifier | "do both systems make matching partial progress?" | "does the agent recover when bash fails?" |
Blocker 1: apr code is a one-shot CLI (apr code -p "<prompt>"). It doesn't support an interactive multi-turn shell session.
Discharge path: P5.2's multi-turn loop spawns apr code once PER TURN with the cumulative history as the prompt. Each invocation is fresh; the agent's "memory" is reconstructed from the prompt history we maintain. This trades inference latency (~30s extra context per turn) for harness simplicity. Future work: a native multi-turn mode in apr code would amortize the cost.
Blocker 2: Multi-turn dialog generates EXPONENTIALLY-growing prompt context (history accumulates). Hits model context limits fast.
Discharge path: history truncation — keep only the last N turns + the original prompt. K=5 is a reasonable starting bound; the agent's "long-term memory" is the repo file system itself (it can re-read files).
Blocker 3: Wall-clock cost. Multi-turn live execution against claude takes 30s/turn × 20 turns × 5 fixtures × 2 systems ≈ 1h per Arena run. *(M222 operator-directive: CCPA uses $0.05-0.20 per turn / $5-20 per run" API-call estimate is OBSOLETE.)*claude CLI session-auth via claude login, NOT the Anthropic API directly; there is no per-turn dollar cost — the operator's Claude Code subscription covers usage. The previous "
Discharge path: --max-wall-seconds env-var (default 900s) caps each fixture's wall budget. No dollar-budget flag needed since CCPA is not API-metered.
Non-blocker (was suspected): RecordedDriver deprecation. Phase 5 does NOT require deprecating ccpa-replayer; the two coexist. ccpa-replayer remains the FALSIFY-CCPA-001/002/003 source-of-truth (trace-schema validation + replay determinism); Arena is the live-evaluation track. R2's "stop hand-authoring canonical JSONL" is a SEPARATE concern from P5 and can be addressed later.
- P5.1 Arena harness scaffolding: SHIPPED at M196.
- P5.2 multi-turn loop: SHIPPED at M200.
- P5.3 Arena bench runner: SHIPPED at M202.
- P5.4 FALSIFY-CCPA-018 gate: SHIPPED at M204.
- P5.5 falsifier-of-falsifier evidence: SHIPPED at M206 (template + comparator code; evidence pending dispatch).
- Phase 5 contract bump v1.28.0 → v1.29.0: SHIPPED at M208 — M22 5-step ritual mirror of aprender PR #1705 registering FALSIFY-CCPA-018 (arena_recovery_rate_bound) at status: PROPOSED. Gate count 17 → 18. PROPOSED → ACTIVE_RUNTIME flip awaits v1.30.0 after first operator-dispatched Arena bench.
- ccpa-arena coverage closure: SHIPPED at M210 — workspace coverage 95.44% → 99.09% lines and 99.75% functions; FALSIFY-CCPA-011 now passes on its own merits (M204-M207 had been admin-merging through the gap). New convention encoded in Makefile + CI:
--ignore-filename-regex '/bin/'excludes operator-dispatch CLI binaries from coverage accounting (their runtime is exercised by outer bash dispatcher scripts, not unit tests).
P5.5 deliverable detail — three deliverables: (a) crates/ccpa-arena/src/falsifier.rs (~140 LOC) — evaluate_static_vs_arena(static_parity, arena_parity, src, src2) -> FalsifierVerdict with 3-variant outcome (StaticFalsified / StaticValidated / Inconclusive); thresholds STATIC_PARITY_THRESHOLD = 0.95 and ARENA_PARITY_CEILING = 0.2 per design-audit.md §5; 8 unit tests covering canonical falsification, both-high validation, exact-boundary semantics, below-floor short-circuit, middle-zone inconclusive, verdict-records-inputs, serde-roundtrip, outcome-tag. (b) crates/ccpa-arena/tests/falsify_static_vs_arena.rs (~110 LOC) — 4 active synthetic tests + 1 #[ignore]'d live-evidence test that loads BOTH evidence/phase-3/multipl-e-rust-scores.json AND evidence/phase-5/arena-scores.json, computes the verdict, pretty-prints it for operator inspection; the live test is informational (no assertion on outcome — the operator takes post-verdict action per the evidence doc). (c) evidence/phase-5/static-fixture-falsification.md (~95 lines) — operator-facing evidence-doc template with placeholders for the per-source numbers (CCPA-016 .agreement, CCPA-018 .oracle_passed_rate, CCPA-017 .partial_agreement), the post-verdict decision matrix, the StaticFalsified action checklist (soft-deprecate CCPA-008 to meter-validation status + promote CCPA-017/018 to user-facing parity claims), the operator-dispatch checklist, and the cross-reference back to design-audit.md §5. Public API: pub use falsifier::{evaluate_static_vs_arena, FalsifierOutcome, FalsifierVerdict, ARENA_PARITY_CEILING, STATIC_PARITY_THRESHOLD}; in lib.rs. Test counts: 72 lib + 7 CCPA-018 active + 4 falsifier active = 83 GREEN; 2 #[ignore]'d (both live-evidence). P5.5 is the audit's primary deliverable made executable: the operator can now run cargo test -p ccpa-arena --test falsify_static_vs_arena -- --ignored post-dispatch to get the empirical verdict.
Phase 5 is post-cleanup at M210 — all 5 sub-deliverables shipped at the code+test level (P5.1-P5.5 / M196-M206), the contract bump shipped at M208 (CCPA-018 registered at PROPOSED), and the coverage closure shipped at M210 (FALSIFY-CCPA-011 green). The Popperian test is now executable code AND the gate is registered in the contract; only the evidence inputs (operator-dispatched Arena bench against the M182 corpus) remain to fully resolve the verdict. Phase 5 substantive arc COMPLETE. Future work: v1.29.0 → v1.30.0 contract bump flipping CCPA-017 + CCPA-018 PROPOSED → ACTIVE_RUNTIME after first operator dispatch produces evidence/phase-4/project-scale-scores.json + evidence/phase-5/arena-scores.json.
- Direct answer to the audit's primary directive (R2). The Phase 4 + Phase 3 path validated function-scale and partial-progress parity; Phase 5 validates the audit's foundational claim that "static fixtures predict live performance".
- CCPA-018 introduces a new metric category (recovery rate) that none of CCPA-001..017 capture. Even if the audit's Popperian test confirms static fixtures DO predict live performance, the recovery-rate measurement is independently valuable.
- Reuses the Phase 4 corpus (M182 5-fixture project-scale corpus). No new fixture authoring needed for the first dispatch; cost is bench-runner code only.
- Aligns with operator priorities: the operator authored design-audit.md and the M192 integration; Phase 5 is the canonical operationalization of that audit.
- The Popperian falsifier IS the test. If we never run it, we cannot claim the project has done what its design audit demanded.
- design-audit.md — operator-authored critique that motivates Phase 5
- phase-4-project-scale-plan.md — Phase 4 plan (P4.1-P4.5); the static-fixture/single-turn baseline Phase 5 compares against
- outcome-parity-plan.md — Phase 3 plan; the function-scale baseline that the audit's R1 critique targets
- risks.md § M192 amendment — Popperian falsifier as a meta-risk
- SWE-bench (arXiv:2310.06770) — Jimenez et al. 2024, the canonical live-execution benchmark Phase 5 emulates
- ProgramBench (arXiv:2605.03546) — Yang et al. 2026, the 0%/200 SOTA-model baseline at project-scale