Top spec: claude-code-parity-apr-poc.md | Phase 2 execution plan | Axis-2 closure plan | Completeness assessment
Operator reframe (2026-05-12): "so we can ask apr code to generate same code as claude code and 'it works'".
This shifts the parity question from procedural parity (do both systems make the same syscalls? — measured by OS-level Jaccard at M148 = 0.3333) to outcome parity (do both systems produce code that does the same thing? — measured by "does it compile and pass a test?"). Outcome parity is closer to what a user actually cares about: "I asked for X; did I get something that does X?".
The axis-2-closure-plan.md M113 brainstorm evaluated 5 paths:
| # | Idea | Measures | Status |
|---|---|---|---|
| 1 | HTTPS proxy | API-level message traces | Phase 1 OOS (M118 prior art validates feasibility) |
| 2 | CLI subprocess instrumentation (strace) | OS-level syscalls — procedural parity | Phase 2 shipped M136-M148; first score 0.3333 |
| 3 | SWE-bench differential evaluation | Outcome parity on real GitHub issues | M149 reframe → primary direction |
| 4 | Metamorphic relations | Behavioral invariants | Backlog |
| 5 | Statistical fingerprint | Population-level skew | Cheap health check |
The operator's reframe elevates idea (3) — but simpler than SWE-bench: instead of 2,294 curated GitHub issues, start with a tiny self-contained code-gen prompt set, run both systems, run the generated code, check it works. This is idea (3.1) lightweight outcome parity — the cheapest path to a user-meaningful parity number.
Procedural parity (M148 score = 0.3333) tells us what each system did at OS level. That's diagnostic. But the parity claim users actually want is:
"If I ask
apr codeto do X, will I get the same outcome as if I askclaude codeto do X?"
That's outcome parity. Two systems can have wildly different procedural traces (different libc, different exec patterns, different tmp-file naming) and still produce identical outcomes — the user wouldn't see the difference. Conversely, two systems can have nearly-identical OS traces and still produce different outcomes if they make different decisions inside their respective LLMs.
The 0.3333 OS-event score is interesting but not load-bearing for the "does apr code work the same as claude code" claim. Outcome parity is load-bearing.
For a code-generation prompt (the bread-and-butter agent task):
- Functional outcome: does the generated code compile + pass a basic test?
- Semantic outcome: does it do what the prompt asked (verified by hand or by an oracle)?
- Diff outcome: are the generated diffs byte-identical or semantically equivalent (per arXiv:2310.06770 SWE-bench methodology)?
The cheapest first test: (1) functional outcome. Both systems generate code → both outputs are run through a compile + test harness → score = fraction of prompts where both pass.
(Numbered P3.x to distinguish from P2.x OS-level work.)
Distinct from M144's OS-event prompts. Each fixture asks for VERIFIABLE code generation:
0001-fib— "Write a Rust functionfib(n: u32) -> u64returning the nth Fibonacci number. Include#[test]asserting fib(10) == 55."0002-palindrome— "Write a functionis_palindrome(s: &str) -> bool. Tests: 'racecar' → true, 'hello' → false."0003-fizzbuzz— "Write a functionfizzbuzz(n: u32) -> Vec<String>returning lines 1..=n with the standard FizzBuzz rules."0004-binary-search— "Writebsearch(haystack: &[i32], needle: i32) -> Option<usize>using binary search."0005-multi-step— composite: "Write a small CLI in Rust that reads numbers from stdin, sorts them, and prints them. Include tests."
Each fixtures/outcome-prompts/<id>/:
prompt.txt— natural-language askoracle/Cargo.toml+oracle/src/lib.rs— REFERENCE solution that the test harness uses as a CHECK (separate from what the systems generate)cwd-tree/— starting state (empty for greenfield prompts)
for p in fixtures/outcome-prompts/*/; do
id=$(basename "$p")
for system in claude apr; do
cwd="/tmp/outcome-$id-$system"
mkdir -p "$cwd"
cp -r "$p/cwd-tree/." "$cwd/"
cd "$cwd"
"${system}" code -p "$(cat $p/prompt.txt)"
cargo test --quiet 2>&1 > "$p/${system}.outcome.txt"
echo $? > "$p/${system}.outcome.exit_code"
done
doneFor each prompt × system: collect compile-and-test exit code. Score = fraction where both system's generated code passes its own embedded tests.
Partial-SHIPPED at M153 as crates/ccpa-differ/src/outcome_diff.rs (cross_output_equivalence()) + crates/ccpa-differ/tests/phase_3_cross_output_equivalence.rs (integration test against live captures). Ships line-set Jaccard over trimmed non-empty lines — the simplest semantically-meaningful similarity metric, robust to whitespace and comment-only churn. Live M150 evidence aggregate: 0.5201 over 5 fixtures (range 0.33–0.83); evidence checked in at evidence/phase-3/cross-output-equivalence.json. Honest interpretation: the README's "nearly-byte-identical" claim about HumanEval_3 was a 1-fixture cherrypick; aggregate similarity sits barely above the 0.5 POC threshold because HumanEval_1 (0.38), HumanEval_2 (0.45), and HumanEval_4 (0.33) show meaningful structural divergence between claude's and apr code's solutions even when both pass.
Future P3.3 sub-deliverables (NOT in M153):
- Files-touched Jaccard: which paths got written? (needs P2.3 OS-event captures, not just generated
.src.rs) - Test-survival rate:
if we swap the test files between the two outputs, do both still pass?SHIPPED at M154 —scripts/phase-3-test-survival.sh+crates/ccpa-differ/tests/phase_3_test_survival_gate.rs+evidence/phase-3/test-survival.json. Live result: survival_rate = 1.0000 (10/10 swaps pass on the 5-fixture M150 corpus). Proves the M153 structural divergence is purely STYLISTIC — every test from either system passes against either implementation. - Diff similarity: text-level Levenshtein or AST-level diff between the two outputs (heavier weight; line-set Jaccard is the M153 stand-in)
SHIPPED at M152 as crates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs. Threshold = 0.5 (POC-tier per the original P3.4 design note); 4 test assertions cover live-evidence pass + per-fixture exit-code consistency + synthetic-regression-below-threshold (bidirectional sensitivity) + synthetic-identity-passes (false-positive guard). Gate is DRAFT until P3.5 / M153+ contract bump registers CCPA-016 in claude-code-parity-apr-v1.yaml.
SHIPPED at M164 (2026-05-13) via the M22 5-step ritual mirroring aprender#1665 (squash 9cbac28b5). v1.26.0 promotes both CCPA-015 (output purity, was PROPOSED at M147) and CCPA-016 (outcome parity bound, was PROPOSED at M152) from PROPOSED → ACTIVE_RUNTIME in the canonical gate registry. Gate count: 14 → 16.
Prior art: ProgramBench (Yang et al., Meta FAIR + Stanford + Harvard, 2026 — added at M159). 200 real-world programs (FFmpeg / SQLite / PHP interpreter / etc.) where LMs reconstruct a codebase from only the executable + documentation, scored via coverage-guided fuzzing-generated behavioral tests. Headline result: 0% of 200 tasks fully resolved; best model 95%-test-pass on 3% of tasks.
Relevance to CCPA: ProgramBench extends our M154 test-survival pattern from function-level (HumanEval/0..4) to project-level (rebuild SQLite). The methodology is:
- Reference: compile gold executable from a real GitHub repo.
- Test generation: agent-driven systematic exploration + coverage-guided iteration produces a behavioral-equivalence test suite from the executable.
- Reconstruction task: hand the candidate LM only the executable + docs; require a working codebase.
- Evaluation: candidate codebase passes/fails the hidden test suite.
Why this is the natural M161+ direction: M150-M154 demonstrated outcome parity is real and measurable on function-level tasks (5 HumanEval problems, agreement 1.0000, test-survival 1.0000). The user-facing parity question scales up: "does apr code rebuild SQLite as well as claude code does?" That question is exactly ProgramBench's question for the multi-model case. M161+ would adapt their pipeline to bilateral CCPA bench (teacher = claude, student = apr code, both run on the same ProgramBench task; scored against the same hidden test suite).
Future gate: hypothetical FALSIFY-CCPA-017 — project-scale outcome-parity bound on a ProgramBench-class corpus. Threshold TBD; if M161+ surfaces ProgramBench's 0% saturation pattern, CCPA-017 would gate at "any non-zero pass" (POC) or "match within 1pp of claude" (production-tier).
Caveats inherited from ProgramBench:
- Project-scale tasks are operator-dispatched (~6h wall-clock per fixture per system in ProgramBench).
- Test-suite quality matters: ProgramBench reports 3.7% dummy-pass-rate with linting vs 18.5% without — naïve test suites give false positives.
- Architectural divergence is the rule, not exception: ProgramBench found 67% of models prefer shallower directory depths than reference code; single-file monolithic implementations dominate. The M153 line-set Jaccard (0.5201 on functions) would likely score even lower at project scale.
| Track | Measures | First number | What it tells us |
|---|---|---|---|
| P2 OS-level Jaccard | Syscall set overlap | 0.3333 (M148) | Procedural divergence; diagnostic |
| P3 outcome parity | Does generated code work? | TBD (M150-M153) | User-facing equivalence; load-bearing for "apr code works like claude code" |
Both have value. P2 catches procedural drift (e.g. "apr writes to a different tmp dir") — useful for debugging. P3 catches outcome drift (e.g. "apr's code doesn't compile") — useful for shipping. P3 is what matters for the operator's question.
Original M149 framing (preserved as audit trail): "Both P2.3 capture AND P3.2 outcome runner need apr code to actually exist as an invokable CLI. That's M3.1 / PMAT-CODE-LLM-DRIVER-PUBLIC-001 — pending in aprender. Until that lands, P3 sub-deliverables M151+ stay in 'designed, not run' state."
M150 finding (2026-05-12): this framing was WRONG. M3.1 / PMAT-CODE-LLM-DRIVER-PUBLIC-001 is about LlmDriver trait visibility (pub(crate) → pub), which was already satisfied. The real blocker was a feature-flag config in apr-cli/Cargo.toml — code = ["dep:batuta"] made the Code { ... } subcommand variant gated behind --features code, so the default cargo install apr-cli produced a binary WITHOUT apr code. aprender#1638 removes the gate (code = [] + batuta becomes non-optional); default build now ships apr code. M150-M154 SHIPPED the full Phase 3 outcome-parity sequence using a locally-built apr with this flag removed — the upstream PR is the formalization, not a prerequisite.
Phase 3 sub-deliverable status post-M210:
- M150 P3.1 outcome-prompt corpus: SHIPPED at
fixtures/multipl-e-rust/(initially 5 HumanEval fixtures; M168 extended to 21 fixtures HumanEval/0..20). - M150 P3.2 outcome runner: SHIPPED at
scripts/phase-3-bench.sh— real bilateral bench producedevidence/phase-3/multipl-e-rust-scores.jsonwithagreement = 1.0000on the original 5 fixtures; recalibrated curve awaits next operator dispatch on the 21-fixture corpus. - M153 P3.3 line-set Jaccard: SHIPPED at
crates/ccpa-differ/src/outcome_diff.rs+evidence/phase-3/cross-output-equivalence.json(aggregate 0.5201). - M154 P3.3 test-survival: SHIPPED at
scripts/phase-3-test-survival.sh+evidence/phase-3/test-survival.json(1.0000 — 10/10 cross-swaps pass). - M152 P3.4 FALSIFY-CCPA-016 gate: SHIPPED at
crates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs(threshold 0.5, bidirectional sensitivity). - M164 P3.5 contract bump v1.25.0 → v1.26.0: SHIPPED via M22 5-step ritual (aprender#1665 squash
9cbac28b5); CCPA-015 + CCPA-016 promoted PROPOSED → ACTIVE_RUNTIME. - M167 contract bump v1.26.0 → v1.27.0: SHIPPED (aprender#1666 squash
e4b673336); CCPA-013 OPEN → ACTIVE_RUNTIME + fixture_corpus_path extended to acceptevidence/phase-3/captures/. - M172 fixture corpus structural validation: SHIPPED at
crates/ccpa-differ/tests/fixture_corpus_structure.rs(4 tests; <1s default cargo test). - M174 fixture deep-correctness validator: SHIPPED at
scripts/validate-fixtures.sh(21/21 PASS in ~30s). - M176 pre-commit hook integration: SHIPPED in
scripts/install-hooks.sh(conditional fire on fixture-touching commits).
Remaining Phase 3 future work (M178+):
- Operator-dispatched recalibrated bench:
bash scripts/phase-3-bench.shon the 21-fixture corpus produces the calibrated pass@1 + agreement curve (M168 corpus now ready). - Bench expansion 21 → full 164 MultiPL-E-Rust problems (operator-dispatched ~5-10 hour wall on a fast host).
- AST-level diff sub-metric (line-set Jaccard is the M153 stand-in;
syn-based AST diff is heavier-signal future work). - P3.6 project-scale parity gate (CCPA-017 candidate) — ProgramBench prior-art (M159) at arXiv:2605.03546 reports 0%/200 fully-resolved across Claude Opus/Sonnet/Haiku + GPT + Gemini, validating the "function-level 1.0 does not extrapolate to project-scale" caveat.
- phase-2-execution-plan.md — P2.x OS-level (procedural)
- axis-2-closure-plan.md — original M113 5-idea brainstorm
- completeness-assessment.md — § "Are we at parity with Claude Code?"
evidence/phase-2/measured-os-parity.json— first procedural parity number (M148)aprender PMAT-CODE-LLM-DRIVER-PUBLIC-001 — the blocker for both P2.3 student capture AND P3.2 outcome run(M150 finding: WAS NOT the actual blocker — see § Implementation blocker above)- aprender#1638 — feature-flag-removal PR that ships
apr codein defaultcargo install apr-clibuild (the real blocker for upstream ergonomics; locally workaroundable)