Top spec: claude-code-parity-apr-poc.md | Outcome-parity plan (Phase 3) | Outcome-parity results | Completeness assessment
Phase 4 = project-scale outcome parity. Extends the Phase 3 outcome-parity arc (P3.1-P3.5, M150-M167) from single-function MultiPL-E-Rust HumanEval problems to multi-file, multi-step project-scale tasks. Each Phase 4 fixture is a small Cargo workspace with an explicit goal and a cargo test oracle; both claude and apr code are dispatched on the same starting state and the deltas are scored.
Prior art: ProgramBench (Yang et al. 2026, arXiv:2605.03546). 200 project-scale tasks; the headline finding is 0%/200 fully resolved across Claude Opus / Sonnet / Haiku, GPT, and Gemini. Phase 4 inherits ProgramBench's task-shape design (multi-file repo + goal + oracle) but operates at companion-tier scale (~5-10 tasks initially, not 200).
Why Phase 4 needs a separate plan from Phase 3: P3.6 was the "project-scale future-work" marker in outcome-parity-plan.md; this doc operationalizes it into P4.1-P4.5 sub-deliverables analogous to P3.1-P3.5. The single biggest design difference: Phase 3's pass@1 ≈ 95% on HumanEval-class problems for both systems (saturation regime); Phase 4 expects few-percent pass@1 at the project-scale layer (signal regime). The user-facing parity question therefore inverts: instead of "do they both pass?" the question becomes "where do they diverge on partial progress?" — a drift-record-density measurement, not a boolean.
If ProgramBench reports 0% fully-resolved across all SOTA models, then on a 5-10 task companion-tier Phase 4 corpus, the realistic first measurement is:
claudeagreement ≈ 0/5 or 1/5 (one easy task might resolve)apr code(Qwen2.5-Coder-1.5B) agreement ≈ 0/5- Outcome-agreement = 1.0 (both fail every task) — vacuously high but uninformative
Phase 4's signal value is NOT in the binary agreement metric. It is in:
- Per-task drift records: which files did each system touch? Which tests did each attempt to write? Which approaches did each take?
- Partial-progress vector: how far along each got (lines edited, files touched, tests added, build status, test pass count).
- Failure-mode classification: where did each system get stuck (parser error, type error, logic loop, gave up)?
Phase 4 is more like SWE-bench instrumentation than HumanEval pass@1. The CCPA-016 outcome-parity gate at threshold 0.5 (Phase 3) is NOT the right gate for Phase 4 — Phase 4 needs a new gate definition (CCPA-017 candidate) that measures partial-progress agreement, not all-or-nothing agreement.
Goal: define fixtures/project-scale/<id>/ layout per task. Per-task: a starting Cargo workspace + goal prompt + completion oracle (set of cargo test invocations).
Proposed layout:
fixtures/project-scale/<id>/
├── prompt.txt # natural-language ask (multi-paragraph; multi-file context)
├── meta.toml # id, source, difficulty_tier, expected_pass_rate_range
├── starting-state/ # cargo workspace at t=0 (committed for reproducibility)
│ ├── Cargo.toml
│ ├── src/
│ ├── tests/
│ └── ...
└── completion-oracle/ # the "done" check
├── tests/ # tests that must pass for the task to be "fully resolved"
└── partial-checks.yaml # gradations: tests-pass-rate, files-touched, build-status
Initial corpus: 5 tasks, drawn from operator-curated real-world stretch goals (e.g. "implement a small CLI subcommand", "add an integration test", "refactor a module"). Aim for tasks where claude and apr code would both make partial progress but neither would fully resolve — that's the signal regime.
Estimated effort: 1-2 days authoring; each fixture is a real ~50-200 LOC Cargo workspace.
Goal: scripts/phase-4-bench.sh operator-dispatched runner analogous to scripts/phase-3-bench.sh.
Per task × system:
cp -r starting-state /tmp/p4-run-<id>-<system>/cd /tmp/p4-run-<id>-<system> && <system> -p "$(cat prompt.txt)"- (System gets ~5-15 minutes wall time; bounded by
APR_TIMEOUT_Senv-var with sensible default) - Snapshot the final repo state to
evidence/phase-4/captures/<id>/<system>/ - Compute per-task metrics: build status, test pass rate, files touched, lines edited
- Aggregate to
evidence/phase-4/project-scale-scores.json
Operator preconditions: same as phase-3-bench.sh — claude logged in, apr on PATH with code subcommand, GGUF model available. Plus: APR_TIMEOUT_S defaults to 900s (15 min) per task (vs Phase 3's 300s default).
Estimated effort: ~1 day authoring; reuses 80% of phase-3-bench.sh structure.
Goal: new module crates/ccpa-differ/src/project_scale_diff.rs consuming per-task captures and emitting ProjectScaleParityReport. Per-task metrics:
| Metric | Range | What it tells us |
|---|---|---|
build_status |
{ok, warn, error} | Did the post-edit workspace compile? |
test_pass_rate |
0.0..1.0 | Fraction of completion-oracle tests passing |
files_touched_jaccard |
0.0..1.0 | Jaccard of teacher.files-touched ∩ student.files-touched |
lines_edited_ratio |
0.0..∞ | Student LOC-delta / teacher LOC-delta |
approach_match |
bool | Did teacher + student touch the same primary file? |
Aggregate (per-corpus):
pass_rate_teacher= mean(test_pass_rate_teacher)pass_rate_student= mean(test_pass_rate_student)partial_agreement= mean per-task min(test_pass_rate_teacher,test_pass_rate_student) — "how often do both make partial progress?"files_jaccard_corpus= mean(files_touched_jaccard)
Estimated effort: 2-3 days; the trickiest piece is the post-state delta extraction (git diff or rsync compare).
Proposed assertion: at threshold T_partial (initial value TBD by first measurement; probably 0.3), require partial_agreement ≥ T_partial AND files_jaccard_corpus ≥ 0.3. Bidirectional sensitivity via synthetic-regression fixture (always-zero teacher vs always-1.0 student) + synthetic-identity fixture (both systems touching identical files with identical pass rate).
Test home: crates/ccpa-differ/tests/falsify_ccpa_017_project_scale_parity.rs. Initial status: PROPOSED until measurement-calibrated. First real Phase 4 dispatch sets the empirical threshold; once stable, gate flips PROPOSED → ACTIVE_RUNTIME at v1.28.0.
Why threshold is TBD: Phase 3's CCPA-016 threshold (0.5) was set anticipating pass@1 saturation. CCPA-017 must be set AFTER first measurement reveals the actual partial-progress regime. Without empirical floor data, picking 0.3 vs 0.5 vs 0.7 a priori is guessing.
Estimated effort: ~1 day test scaffold; threshold-calibration is a downstream step.
M22 5-step ritual mirroring an aprender PR that:
- Adds FALSIFY-CCPA-017 to the gate registry (status: PROPOSED initially, ACTIVE_RUNTIME after first measurement)
- Registers the project-scale corpus schema as a recognized
fixture_corpus_pathcandidate - Records first measured
project_scale_parityblock under CCPA-013's evidence list (or a new CCPA-017 evidence list, depending on registry design) - Bumps version
1.27.0→1.28.0
Companion side: standard M22 ritual — pin.lock refresh + contract YAML mirror + 5 cross-reference surface bumps + new falsification-conditions.md row.
Estimated effort: ~half-day companion side; ~1 day aprender side (contract YAML authoring + status_history entry).
| Dimension | Phase 3 (M150-M167, SHIPPED) | Phase 4 (M180+, PROPOSED) |
|---|---|---|
| Corpus | 21 MultiPL-E-Rust fixtures (HumanEval/0..20) | 5-10 project-scale tasks (multi-file Cargo workspaces) |
| Per-fixture scope | single function, <50 LOC reference | small workspace, ~50-200 LOC starting state |
| Oracle | cargo test exit code |
partial-progress vector + build/test/touched-files |
| Expected pass@1 | ≥0.95 (saturation regime) | <0.10 (signal regime per ProgramBench prior-art) |
| Primary metric | outcome agreement = both_pass + both_fail / N | partial-agreement = mean min(teacher_pass_rate, student_pass_rate) |
| Gate threshold | 0.5 (CCPA-016) | TBD empirically (CCPA-017 candidate) |
| Wall time per dispatch | ~10-30 min for 21 fixtures | ~1-3 hours for 5 tasks |
| Contract bump | v1.25 → v1.26 (M164) + v1.26 → v1.27 (M167) | v1.27 → v1.28 (M180+ candidate) |
Blocker 1: No project-scale tasks exist yet that are well-scoped enough to be CCPA fixtures.
Discharge path: P4.1 authoring is the work. Operator could seed from real GitHub issues against this companion repo, an aprender-side issue, or a public benchmark like ProgramBench's 200-task corpus (license permitting).
Blocker 2: apr code wall-time per task may be prohibitive on the operator's Qwen2.5-Coder-1.5B GGUF setup. Multi-file edits at ~30 tokens/sec ≈ 100 LOC/min generation; for a 500-LOC task that's 5+ minutes pure inference.
Discharge path: per-task APR_TIMEOUT_S env-var (default 900s = 15 min) caps the worst case. If wall time becomes infeasible, the operator can either upgrade to a larger Qwen model on GPU (e.g. Qwen2.5-Coder-7B at ~150 tok/s) or reduce per-task LOC scope.
Blocker 3: claude wall-time per task may be non-trivial — a 5-task Phase 4 run takes 10-30 min wall depending on task complexity. *(M222 operator-directive: CCPA uses $1-3 in API calls" estimate is OBSOLETE.)*claude CLI session-auth via claude login, NOT the Anthropic API directly; there is no per-API-call dollar cost — the operator's Claude Code subscription covers the usage. The previous "
Discharge path: wall-time-aware operator dispatch; the bench-runner already supports a --max-wall-seconds budget flag that aborts after the wall-clock threshold. No dollar-budget flag needed since CCPA is not API-metered.
Non-blocker (was suspected): SWE-bench-class infrastructure (Docker containers, runtime isolation). Phase 4 fixtures are small enough that a cp -r + tempdir is sufficient isolation; no container layer needed.
- P4.1 corpus structure: SHIPPED at M182 — 5-fixture initial corpus at
fixtures/project-scale/drawn from real open issues across paiml/decy + paiml/bashrs + paiml/depyler. Each fixture:prompt.txt(verbatim from issue body) +meta.toml(id, source URL, difficulty, repo + pre-fix commit SHA, completion oracle command). Structural validation test atcrates/ccpa-differ/tests/project_scale_corpus_structure.rs(5 tests; 5/5 GREEN). Design deviation from M180 plan: the plan envisionedstarting-state/+completion-oracle/subdirs per fixture for full filesystem-level reproducibility. For real-repo issues against decy / bashrs / depyler (685+ Rust files), snapshotting into each fixture is impractical. Instead each fixture pinsrepo.pre_fix_commitand the P4.2 runner clones at dispatch time — trades filesystem-level reproducibility for fixture-dir tractability, but commit-level reproducibility is preserved via the SHA pin. - P4.2 runner: SHIPPED at M184 —
scripts/phase-4-bench.sh(288 lines bash) implements the operator-dispatch entry point. Per fixture × system (teacher=claude, student=apr code): clones the pinnedpre_fix_commitSHA into a tempdir, dispatches the system withtimeout ${APR_TIMEOUT_S}(default 900s = 15 min), snapshots the resulting diff vs SHA, runs the fixture'soracle_cmdin the post-edit state, records exit code + pattern match. Emits per-fixture + aggregate metrics toevidence/phase-4/project-scale-scores.json(teacher_pass_rate, student_pass_rate, agreement, partial_progress, per-fixture files_touched_jaccard via jq set-arithmetic). Preflight verifies claude + apr binaries + git + jq. Operator dispatches viabash scripts/phase-4-bench.sh. - P4.3 partial-progress scoring: SHIPPED at M186 — new module
crates/ccpa-differ/src/project_scale_diff.rs(~310 lines) consuming the P4.2 runner's JSON output. Types:ProjectScaleParityReport(corpus-level),PerFixtureScore(per-task),SideScore(per-side),RepoInfo. Loader:ProjectScaleParityReport::from_json_str()parses raw JSON + enriches with 3 derived corpus-level metrics (partial_agreement,files_jaccard_corpus,approach_match_rate) + 2 derived per-fixture metrics (approach_match,lines_edited_ratio). Gate predicate:passes_threshold(partial_threshold, files_threshold)returns true iffpartial_agreement >= partial_threshold AND files_jaccard_corpus >= files_threshold. 14 unit tests; 14/14 GREEN. Public API:pub use project_scale_diff::{PerFixtureScore, ProjectScaleParityReport, RepoInfo as ProjectScaleRepoInfo, SideScore};incrates/ccpa-differ/src/lib.rs. - P4.4 CCPA-017 gate: SHIPPED at M188 — test scaffold at
crates/ccpa-differ/tests/falsify_ccpa_017_project_scale_parity.rswith 7 active tests + 1#[ignore]'d live-evidence test (fires only after operator dispatch). Tentative thresholds:PARTIAL_AGREEMENT_THRESHOLD = 0.3,FILES_JACCARD_THRESHOLD = 0.3. Bidirectional sensitivity verified: synthetic identity corpus (both pass on same files) → gate passes; synthetic regression corpus (one pass / one fail, different files) → gate fails. Edge cases covered: exactly-at-threshold passes (>= not >); just-below-partial fails; just-below-files-jaccard fails; empty corpus vacuously fails (forces operator to actually dispatch).MIN_CORPUS_SIZE = 3enforced in the live-evidence test (the M182 corpus of 5 satisfies). Threshold-recalibration awaits first operator-dispatched run ofbash scripts/phase-4-bench.sh. - P4.5 contract bump: SHIPPED at M190 (v1.27.0 → v1.28.0; M22 ritual mirror of aprender PR #1684; gate count 16 → 17; CCPA-017 registered at status: PROPOSED).
Phase 4 is substantively COMPLETE post-M190 — all five sub-deliverables (P4.1 corpus M182, P4.2 runner M184, P4.3 scorer M186, P4.4 CCPA-017 gate scaffold M188, P4.5 contract bump v1.27.0 → v1.28.0 M190) SHIPPED. CCPA-017 enters at status: PROPOSED at v1.28.0; the PROPOSED → ACTIVE_RUNTIME flip awaits the operator-dispatched first measurement via bash scripts/phase-4-bench.sh to calibrate the empirical threshold, then a v1.30.0 contract bump. Phase 5 (M194-M210) added the live-Arena complement at CCPA-018 (PROPOSED at v1.29.0 / M208); both CCPA-017 + CCPA-018 share the v1.30.0 flip path post-dispatch.
- Closes the M159 ProgramBench prior-art into an actionable track. The M159 row noted "validates the 'function-level 1.0 does not extrapolate to project-scale' caveat" — but didn't operationalize it. Phase 4 does.
- The Phase 3 outcome-parity question is settled ("YES on 5-problem POC; 21-fixture recalibration awaits operator dispatch"). The next honest parity question is project-scale.
- Even without P4.1+ code shipping, the plan doc anchors future work. M0-M50's milestone rows had similar "plan first, execute later" structure (e.g. M2.3 rescope was documented before its consequences shipped).
- Aligns with the operator's framing: M149's "outcome parity is the user-facing measure" extended naturally from "does the function work?" to "does the project work?" — Phase 4 is the project-scale extension of the same question.
- outcome-parity-plan.md — Phase 3 plan (P3.1-P3.5); P3.6 future-work marker that this doc operationalizes
- outcome-parity-results.md — M157 consolidated Phase 3 results (Axis 2 ~85% post-M177)
- completeness-assessment.md — § Axis 2 closure path (Phase 4 is the next 5-15% increment)
- ProgramBench (arXiv:2605.03546) — Yang et al. 2026, project-scale parity prior art
- SWE-bench (arXiv:2310.06770) — Jimenez et al. 2024, related project-scale benchmark