Skip to content

Latest commit

 

History

History
162 lines (111 loc) · 16.2 KB

File metadata and controls

162 lines (111 loc) · 16.2 KB

Phase 4 project-scale parity plan (M180, 2026-05-15)

Top spec: claude-code-parity-apr-poc.md | Outcome-parity plan (Phase 3) | Outcome-parity results | Completeness assessment

Scope

Phase 4 = project-scale outcome parity. Extends the Phase 3 outcome-parity arc (P3.1-P3.5, M150-M167) from single-function MultiPL-E-Rust HumanEval problems to multi-file, multi-step project-scale tasks. Each Phase 4 fixture is a small Cargo workspace with an explicit goal and a cargo test oracle; both claude and apr code are dispatched on the same starting state and the deltas are scored.

Prior art: ProgramBench (Yang et al. 2026, arXiv:2605.03546). 200 project-scale tasks; the headline finding is 0%/200 fully resolved across Claude Opus / Sonnet / Haiku, GPT, and Gemini. Phase 4 inherits ProgramBench's task-shape design (multi-file repo + goal + oracle) but operates at companion-tier scale (~5-10 tasks initially, not 200).

Why Phase 4 needs a separate plan from Phase 3: P3.6 was the "project-scale future-work" marker in outcome-parity-plan.md; this doc operationalizes it into P4.1-P4.5 sub-deliverables analogous to P3.1-P3.5. The single biggest design difference: Phase 3's pass@1 ≈ 95% on HumanEval-class problems for both systems (saturation regime); Phase 4 expects few-percent pass@1 at the project-scale layer (signal regime). The user-facing parity question therefore inverts: instead of "do they both pass?" the question becomes "where do they diverge on partial progress?" — a drift-record-density measurement, not a boolean.

Honest scoping caveat

If ProgramBench reports 0% fully-resolved across all SOTA models, then on a 5-10 task companion-tier Phase 4 corpus, the realistic first measurement is:

  • claude agreement ≈ 0/5 or 1/5 (one easy task might resolve)
  • apr code (Qwen2.5-Coder-1.5B) agreement ≈ 0/5
  • Outcome-agreement = 1.0 (both fail every task) — vacuously high but uninformative

Phase 4's signal value is NOT in the binary agreement metric. It is in:

  • Per-task drift records: which files did each system touch? Which tests did each attempt to write? Which approaches did each take?
  • Partial-progress vector: how far along each got (lines edited, files touched, tests added, build status, test pass count).
  • Failure-mode classification: where did each system get stuck (parser error, type error, logic loop, gave up)?

Phase 4 is more like SWE-bench instrumentation than HumanEval pass@1. The CCPA-016 outcome-parity gate at threshold 0.5 (Phase 3) is NOT the right gate for Phase 4 — Phase 4 needs a new gate definition (CCPA-017 candidate) that measures partial-progress agreement, not all-or-nothing agreement.

Sub-deliverables (P4.1-P4.5)

P4.1 — Project-scale corpus structure

Goal: define fixtures/project-scale/<id>/ layout per task. Per-task: a starting Cargo workspace + goal prompt + completion oracle (set of cargo test invocations).

Proposed layout:

fixtures/project-scale/<id>/
├── prompt.txt           # natural-language ask (multi-paragraph; multi-file context)
├── meta.toml            # id, source, difficulty_tier, expected_pass_rate_range
├── starting-state/      # cargo workspace at t=0 (committed for reproducibility)
│   ├── Cargo.toml
│   ├── src/
│   ├── tests/
│   └── ...
└── completion-oracle/   # the "done" check
    ├── tests/           # tests that must pass for the task to be "fully resolved"
    └── partial-checks.yaml  # gradations: tests-pass-rate, files-touched, build-status

Initial corpus: 5 tasks, drawn from operator-curated real-world stretch goals (e.g. "implement a small CLI subcommand", "add an integration test", "refactor a module"). Aim for tasks where claude and apr code would both make partial progress but neither would fully resolve — that's the signal regime.

Estimated effort: 1-2 days authoring; each fixture is a real ~50-200 LOC Cargo workspace.

P4.2 — Project-scale runner

Goal: scripts/phase-4-bench.sh operator-dispatched runner analogous to scripts/phase-3-bench.sh.

Per task × system:

  1. cp -r starting-state /tmp/p4-run-<id>-<system>/
  2. cd /tmp/p4-run-<id>-<system> && <system> -p "$(cat prompt.txt)"
  3. (System gets ~5-15 minutes wall time; bounded by APR_TIMEOUT_S env-var with sensible default)
  4. Snapshot the final repo state to evidence/phase-4/captures/<id>/<system>/
  5. Compute per-task metrics: build status, test pass rate, files touched, lines edited
  6. Aggregate to evidence/phase-4/project-scale-scores.json

Operator preconditions: same as phase-3-bench.shclaude logged in, apr on PATH with code subcommand, GGUF model available. Plus: APR_TIMEOUT_S defaults to 900s (15 min) per task (vs Phase 3's 300s default).

Estimated effort: ~1 day authoring; reuses 80% of phase-3-bench.sh structure.

P4.3 — Partial-progress scoring

Goal: new module crates/ccpa-differ/src/project_scale_diff.rs consuming per-task captures and emitting ProjectScaleParityReport. Per-task metrics:

Metric Range What it tells us
build_status {ok, warn, error} Did the post-edit workspace compile?
test_pass_rate 0.0..1.0 Fraction of completion-oracle tests passing
files_touched_jaccard 0.0..1.0 Jaccard of teacher.files-touched ∩ student.files-touched
lines_edited_ratio 0.0..∞ Student LOC-delta / teacher LOC-delta
approach_match bool Did teacher + student touch the same primary file?

Aggregate (per-corpus):

  • pass_rate_teacher = mean(test_pass_rate_teacher)
  • pass_rate_student = mean(test_pass_rate_student)
  • partial_agreement = mean per-task min(test_pass_rate_teacher, test_pass_rate_student) — "how often do both make partial progress?"
  • files_jaccard_corpus = mean(files_touched_jaccard)

Estimated effort: 2-3 days; the trickiest piece is the post-state delta extraction (git diff or rsync compare).

P4.4 — FALSIFY-CCPA-017 gate (project-scale parity bound)

Proposed assertion: at threshold T_partial (initial value TBD by first measurement; probably 0.3), require partial_agreement ≥ T_partial AND files_jaccard_corpus ≥ 0.3. Bidirectional sensitivity via synthetic-regression fixture (always-zero teacher vs always-1.0 student) + synthetic-identity fixture (both systems touching identical files with identical pass rate).

Test home: crates/ccpa-differ/tests/falsify_ccpa_017_project_scale_parity.rs. Initial status: PROPOSED until measurement-calibrated. First real Phase 4 dispatch sets the empirical threshold; once stable, gate flips PROPOSED → ACTIVE_RUNTIME at v1.28.0.

Why threshold is TBD: Phase 3's CCPA-016 threshold (0.5) was set anticipating pass@1 saturation. CCPA-017 must be set AFTER first measurement reveals the actual partial-progress regime. Without empirical floor data, picking 0.3 vs 0.5 vs 0.7 a priori is guessing.

Estimated effort: ~1 day test scaffold; threshold-calibration is a downstream step.

P4.5 — Contract bump v1.27.0 → v1.28.0

M22 5-step ritual mirroring an aprender PR that:

  • Adds FALSIFY-CCPA-017 to the gate registry (status: PROPOSED initially, ACTIVE_RUNTIME after first measurement)
  • Registers the project-scale corpus schema as a recognized fixture_corpus_path candidate
  • Records first measured project_scale_parity block under CCPA-013's evidence list (or a new CCPA-017 evidence list, depending on registry design)
  • Bumps version 1.27.01.28.0

Companion side: standard M22 ritual — pin.lock refresh + contract YAML mirror + 5 cross-reference surface bumps + new falsification-conditions.md row.

Estimated effort: ~half-day companion side; ~1 day aprender side (contract YAML authoring + status_history entry).

Phase 4 vs Phase 3 — comparison table

Dimension Phase 3 (M150-M167, SHIPPED) Phase 4 (M180+, PROPOSED)
Corpus 21 MultiPL-E-Rust fixtures (HumanEval/0..20) 5-10 project-scale tasks (multi-file Cargo workspaces)
Per-fixture scope single function, <50 LOC reference small workspace, ~50-200 LOC starting state
Oracle cargo test exit code partial-progress vector + build/test/touched-files
Expected pass@1 ≥0.95 (saturation regime) <0.10 (signal regime per ProgramBench prior-art)
Primary metric outcome agreement = both_pass + both_fail / N partial-agreement = mean min(teacher_pass_rate, student_pass_rate)
Gate threshold 0.5 (CCPA-016) TBD empirically (CCPA-017 candidate)
Wall time per dispatch ~10-30 min for 21 fixtures ~1-3 hours for 5 tasks
Contract bump v1.25 → v1.26 (M164) + v1.26 → v1.27 (M167) v1.27 → v1.28 (M180+ candidate)

Implementation blockers and discharges

Blocker 1: No project-scale tasks exist yet that are well-scoped enough to be CCPA fixtures.

Discharge path: P4.1 authoring is the work. Operator could seed from real GitHub issues against this companion repo, an aprender-side issue, or a public benchmark like ProgramBench's 200-task corpus (license permitting).

Blocker 2: apr code wall-time per task may be prohibitive on the operator's Qwen2.5-Coder-1.5B GGUF setup. Multi-file edits at ~30 tokens/sec ≈ 100 LOC/min generation; for a 500-LOC task that's 5+ minutes pure inference.

Discharge path: per-task APR_TIMEOUT_S env-var (default 900s = 15 min) caps the worst case. If wall time becomes infeasible, the operator can either upgrade to a larger Qwen model on GPU (e.g. Qwen2.5-Coder-7B at ~150 tok/s) or reduce per-task LOC scope.

Blocker 3: claude wall-time per task may be non-trivial — a 5-task Phase 4 run takes 10-30 min wall depending on task complexity. *(M222 operator-directive: CCPA uses claude CLI session-auth via claude login, NOT the Anthropic API directly; there is no per-API-call dollar cost — the operator's Claude Code subscription covers the usage. The previous "$1-3 in API calls" estimate is OBSOLETE.)*

Discharge path: wall-time-aware operator dispatch; the bench-runner already supports a --max-wall-seconds budget flag that aborts after the wall-clock threshold. No dollar-budget flag needed since CCPA is not API-metered.

Non-blocker (was suspected): SWE-bench-class infrastructure (Docker containers, runtime isolation). Phase 4 fixtures are small enough that a cp -r + tempdir is sufficient isolation; no container layer needed.

Status post-M188

  • P4.1 corpus structure: SHIPPED at M182 — 5-fixture initial corpus at fixtures/project-scale/ drawn from real open issues across paiml/decy + paiml/bashrs + paiml/depyler. Each fixture: prompt.txt (verbatim from issue body) + meta.toml (id, source URL, difficulty, repo + pre-fix commit SHA, completion oracle command). Structural validation test at crates/ccpa-differ/tests/project_scale_corpus_structure.rs (5 tests; 5/5 GREEN). Design deviation from M180 plan: the plan envisioned starting-state/ + completion-oracle/ subdirs per fixture for full filesystem-level reproducibility. For real-repo issues against decy / bashrs / depyler (685+ Rust files), snapshotting into each fixture is impractical. Instead each fixture pins repo.pre_fix_commit and the P4.2 runner clones at dispatch time — trades filesystem-level reproducibility for fixture-dir tractability, but commit-level reproducibility is preserved via the SHA pin.
  • P4.2 runner: SHIPPED at M184scripts/phase-4-bench.sh (288 lines bash) implements the operator-dispatch entry point. Per fixture × system (teacher=claude, student=apr code): clones the pinned pre_fix_commit SHA into a tempdir, dispatches the system with timeout ${APR_TIMEOUT_S} (default 900s = 15 min), snapshots the resulting diff vs SHA, runs the fixture's oracle_cmd in the post-edit state, records exit code + pattern match. Emits per-fixture + aggregate metrics to evidence/phase-4/project-scale-scores.json (teacher_pass_rate, student_pass_rate, agreement, partial_progress, per-fixture files_touched_jaccard via jq set-arithmetic). Preflight verifies claude + apr binaries + git + jq. Operator dispatches via bash scripts/phase-4-bench.sh.
  • P4.3 partial-progress scoring: SHIPPED at M186 — new module crates/ccpa-differ/src/project_scale_diff.rs (~310 lines) consuming the P4.2 runner's JSON output. Types: ProjectScaleParityReport (corpus-level), PerFixtureScore (per-task), SideScore (per-side), RepoInfo. Loader: ProjectScaleParityReport::from_json_str() parses raw JSON + enriches with 3 derived corpus-level metrics (partial_agreement, files_jaccard_corpus, approach_match_rate) + 2 derived per-fixture metrics (approach_match, lines_edited_ratio). Gate predicate: passes_threshold(partial_threshold, files_threshold) returns true iff partial_agreement >= partial_threshold AND files_jaccard_corpus >= files_threshold. 14 unit tests; 14/14 GREEN. Public API: pub use project_scale_diff::{PerFixtureScore, ProjectScaleParityReport, RepoInfo as ProjectScaleRepoInfo, SideScore}; in crates/ccpa-differ/src/lib.rs.
  • P4.4 CCPA-017 gate: SHIPPED at M188 — test scaffold at crates/ccpa-differ/tests/falsify_ccpa_017_project_scale_parity.rs with 7 active tests + 1 #[ignore]'d live-evidence test (fires only after operator dispatch). Tentative thresholds: PARTIAL_AGREEMENT_THRESHOLD = 0.3, FILES_JACCARD_THRESHOLD = 0.3. Bidirectional sensitivity verified: synthetic identity corpus (both pass on same files) → gate passes; synthetic regression corpus (one pass / one fail, different files) → gate fails. Edge cases covered: exactly-at-threshold passes (>= not >); just-below-partial fails; just-below-files-jaccard fails; empty corpus vacuously fails (forces operator to actually dispatch). MIN_CORPUS_SIZE = 3 enforced in the live-evidence test (the M182 corpus of 5 satisfies). Threshold-recalibration awaits first operator-dispatched run of bash scripts/phase-4-bench.sh.
  • P4.5 contract bump: SHIPPED at M190 (v1.27.0 → v1.28.0; M22 ritual mirror of aprender PR #1684; gate count 16 → 17; CCPA-017 registered at status: PROPOSED).

Phase 4 is substantively COMPLETE post-M190 — all five sub-deliverables (P4.1 corpus M182, P4.2 runner M184, P4.3 scorer M186, P4.4 CCPA-017 gate scaffold M188, P4.5 contract bump v1.27.0 → v1.28.0 M190) SHIPPED. CCPA-017 enters at status: PROPOSED at v1.28.0; the PROPOSED → ACTIVE_RUNTIME flip awaits the operator-dispatched first measurement via bash scripts/phase-4-bench.sh to calibrate the empirical threshold, then a v1.30.0 contract bump. Phase 5 (M194-M210) added the live-Arena complement at CCPA-018 (PROPOSED at v1.29.0 / M208); both CCPA-017 + CCPA-018 share the v1.30.0 flip path post-dispatch.

Why this is high EV

  1. Closes the M159 ProgramBench prior-art into an actionable track. The M159 row noted "validates the 'function-level 1.0 does not extrapolate to project-scale' caveat" — but didn't operationalize it. Phase 4 does.
  2. The Phase 3 outcome-parity question is settled ("YES on 5-problem POC; 21-fixture recalibration awaits operator dispatch"). The next honest parity question is project-scale.
  3. Even without P4.1+ code shipping, the plan doc anchors future work. M0-M50's milestone rows had similar "plan first, execute later" structure (e.g. M2.3 rescope was documented before its consequences shipped).
  4. Aligns with the operator's framing: M149's "outcome parity is the user-facing measure" extended naturally from "does the function work?" to "does the project work?" — Phase 4 is the project-scale extension of the same question.

Cross-refs