Status: SHIPPED on 5-fixture MultiPL-E-Rust POC corpus. Consolidates the empirical findings from M150-M154 + the spec-honesty refresh at M155-M156 into one citable place.
On the 5-fixture MultiPL-E-Rust HumanEval/0..4 POC corpus, the answer to the operator's question "so we can ask apr code to generate same code as claude code and 'it works'?" is YES:
- Outcome parity =
1.0000— both realclaude(Anthropic 2.1.139) and realapr code(aprender 0.32.0 + Qwen2.5-Coder-1.5B-Instruct GGUF Q4_K_M) produce Rust that compiles and passes the test oracle on all 5 problems. - Test-survival =
1.0000— every test from either system runs correctly against either implementation; the two systems are functionally interchangeable. - Structural similarity =
0.5201— code looks ~50% similar at the line level; the divergence is purely stylistic (variable names, type annotations, idiom choice), not semantic.
The current parity claim is 5/5 BOTH_PASS scoped to this POC corpus. Bench expansion 5 → full 164 problems (M158+) will tighten the claim to a pass@1 + agreement curve.
| Metric | Value | Source | What it tells us |
|---|---|---|---|
| Outcome parity | 1.0000 (5/5 BOTH_PASS) | M150 — scripts/phase-3-bench.sh → evidence/phase-3/multipl-e-rust-scores.json |
Both systems pass their own MultiPL-E-Rust tests |
| Outcome parity gate | >= 0.5 (current 1.0) |
M152 — FALSIFY-CCPA-016 in ccpa-differ (PROPOSED; threshold enforced via test) |
POC-tier threshold enforced; bidirectional sensitivity via synthetic regression + identity fixtures |
| Structural similarity | 0.5201 (range 0.33–0.83, zero byte-identical pairs) | M153 — ccpa_differ::cross_output_equivalence() → evidence/phase-3/cross-output-equivalence.json |
Line-set Jaccard over trimmed non-empty lines |
| Test-survival | 1.0000 (10/10 cross-swaps) | M154 — scripts/phase-3-test-survival.sh → evidence/phase-3/test-survival.json |
Every test from either system passes against either implementation — SEMANTIC equivalence |
| Fixture | Outcome (T/S) | Structural Jaccard | Cross-swap (a / b) |
|---|---|---|---|
| HumanEval_0 has_close_elements | PASS / PASS | 0.8333 (10/12 lines shared) | swap_a=0 / swap_b=0 |
| HumanEval_1 separate_paren_groups | PASS / PASS | 0.3793 (11/29) | swap_a=0 / swap_b=0 |
| HumanEval_2 truncate_number | PASS / PASS | 0.4545 (5/11) | swap_a=0 / swap_b=0 |
| HumanEval_3 below_zero | PASS / PASS | 0.6000 (9/15) | swap_a=0 / swap_b=0 |
| HumanEval_4 mean_absolute_deviation | PASS / PASS | 0.3333 (4/12) | swap_a=0 / swap_b=0 |
| Aggregate | 5/5 / 5/5 BOTH_PASS | 0.5201 | 10/10 PASS |
swap_a = teacher's function compiled with student's tests. swap_b = student's function compiled with teacher's tests. Exit code 0 = cargo test PASS.
apr codeworks end-to-end on real coding tasks — the orchestration loop (LLM driver + agent harness + tool dispatch) is functional with a local Qwen2.5-Coder-1.5B GGUF model on consumer hardware.- The user-facing parity question has a measured answer on this corpus — not a theoretical one. The 4 metrics are computed from real bilateral bench runs on the operator's
noah-Lambda-Vectorhost, not synthesized or extrapolated. - Outcome parity, structural similarity, and semantic equivalence are orthogonal — two systems can BOTH pass while looking only ~50% alike at line level (M153), and they can be semantically equivalent at the test-survival level even when stylistically divergent (M154). The 4-metric grid surfaces these distinctions explicitly.
apr codeships in the default build — aprender#1638 MERGED 2026-05-13 (squashb61b76b4) —cargo install apr-clinow shipsapr codewithout--features code. M150-M154 used a locally-built apr with the flag manually removed; M162 confirmed the upstream landing. The previously-suspected blockerPMAT-CODE-LLM-DRIVER-PUBLIC-001(LlmDriver visibility) turned out to not gate the work; seecompleteness-assessment.md§ Axis 3 for the discharge.
- Parity on the full 164-problem MultiPL-E-Rust corpus — the 5-problem POC sits at near-saturation territory (HumanEval problems with simple specs and clear test oracles); expansion will produce a tighter pass@1 + agreement curve and may surface divergence.
- Parity on prompts with structural ambiguity — these 5 fixtures all have well-specified contracts (input/output types + example case). Open-ended prompts ("refactor this codebase", "design a new module") would test a different aspect of orchestration.
- Parity on multi-turn / tool-using workflows — these fixtures use
--max-turns 1for apr code; multi-turn agent-loop tasks (file editing, codebase navigation, test-driven development) are outside the M150 bench scope. - Procedural parity at the OS level — separately measured at M148 (
evidence/phase-2/measured-os-parity.json) and currently0.3333(well below the M139 FALSIFY-CCPA-014 threshold 0.95). Procedural parity is the diagnostic; outcome parity is the user-facing measure. They disagree by design — different runtimes (Node.js claude vs Rust apr) produce different syscall sets even when generating equivalent code. - Parity over time — model versions drift; Anthropic ships new claude releases; aprender ships new apr versions; bench numbers from 2026-05-12 are a snapshot, not a permanent claim.
- Parity at project scale — see ProgramBench (Yang et al., 2026, arXiv:2605.03546): when LMs are asked to reconstruct full real-world programs (FFmpeg / SQLite / PHP interpreter) from executable + docs, 0% of 200 tasks are fully resolved and the best model passes ≥95% tests on only 3% of tasks. M150-M154's 1.0000 result is for function-level tasks (HumanEval-style single-function generation); project-scale outcome parity is the natural M161+ P3.6 future-work — see outcome-parity-plan.md § P3.6.
| Milestone | Axis 2 score | Source |
|---|---|---|
| Pre-M136 (machinery TBD) | ~30% | M111 honest assessment |
| Post-M141 (machinery shipped, no real captures) | ~45% | M140 honest assessment |
| Post-M149 (Phase 3 plan documented) | ~50% | M149 reframe |
| Post-M154 (4-metric real-binary evidence) | ~70% | M155 honest re-assessment |
| Post-M162 (aprender#1638 MERGED; LlmDriver-adapter row FULLY DISCHARGED) | ~75% | M162 |
| Post-M164 (P3.5 contract bump v1.26.0; CCPA-015 + 016 ACTIVE_RUNTIME) | ~80% | M164 |
| Post-M167 (v1.27.0; CCPA-013 OPEN → ACTIVE_RUNTIME; last OPEN gate closed) | ~82% | M167 |
| Post-M177 (corpus 5 → 21; fixture validation layered; pre-commit integration) | ~85% | M177 |
| Post-M190 (Phase 4 P4.1-P4.5 SHIPPED — project-scale corpus + runner + scorer + CCPA-017 gate scaffold + contract bump v1.27.0 → v1.28.0 PROPOSED) | ~87% | M190 |
| Post-M210 (Phase 5 P5.1-P5.5 SHIPPED — Arena harness + multi-turn loop + bench runner + CCPA-018 gate scaffold + falsifier-of-falsifier comparator; contract bump v1.28.0 → v1.29.0 SHIPPED M208; coverage closure M210) | ~90% | M210 |
| One-number summary post-M210 | ~90% | M210 refresh |
Remaining ~10% to fully closed Axis 2: (a) operator-dispatched recalibrated bench on the 21-fixture MultiPL-E-Rust corpus via bash scripts/phase-3-bench.sh (one-command; produces the next agreement curve); (b) operator-dispatched first project-scale bench on the 5-fixture corpus via bash scripts/phase-4-bench.sh (calibrates CCPA-017 threshold from real data; flips PROPOSED → ACTIVE_RUNTIME at v1.30.0); (c) operator-dispatched first Arena bench on the M182 corpus via bash scripts/phase-5-arena-bench.sh (calibrates CCPA-018 thresholds + answers the design-audit.md §5 Popperian test; flips PROPOSED → ACTIVE_RUNTIME at v1.30.0); (d) bench expansion 21 → full 164 MultiPL-E-Rust (operator-dispatched ~5-10h wall); (e) optional AST-level diff sub-metric (syn-based, complements M153 line-set Jaccard).
All numbers in this doc are reproducible from these checked-in evidence files:
evidence/phase-3/multipl-e-rust-scores.json— outcome parity (M150)evidence/phase-3/cross-output-equivalence.json— structural similarity (M153)evidence/phase-3/test-survival.json— test-survival rate (M154)evidence/phase-3/captures/<id>/{teacher,student}.{src.rs,test-output,exit_code}— per-fixture per-side audit trailevidence/phase-2/measured-os-parity.json— procedural parity (M148, diagnostic)evidence/phase-2/captures/<id>/{teacher,student}.{ccpa-os-trace.jsonl,stderr,exit_code}— OS-event per-fixture (M147)
| Gate | Status | Source |
|---|---|---|
| FALSIFY-CCPA-014 OS-event parity bound (threshold 0.95) | ACTIVE_RUNTIME at v1.25.0 | claude-code-parity-apr-v1.yaml |
| FALSIFY-CCPA-015 output purity | ACTIVE_RUNTIME at v1.26.0 (M164; was PROPOSED at v1.25.0 / M147) | crates/ccpa-subproc/tests/falsify_ccpa_015_output_purity.rs |
| FALSIFY-CCPA-016 outcome parity bound (threshold 0.5) | ACTIVE_RUNTIME at v1.26.0 (M164; was PROPOSED at v1.25.0 / M152) | crates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs |
P3.5 contract bump v1.25.0 → v1.26.0 SHIPPED at M164 (2026-05-13) via the M22 5-step ritual mirroring aprender#1665 squash 9cbac28b5. Gate count: 14 → 16.
- outcome-parity-plan.md — original P3.1-P3.5 design + sub-deliverable status
- phase-2-execution-plan.md — procedural (OS-level) parity plan
- axis-2-closure-plan.md — M113 5-idea brainstorm
- completeness-assessment.md § "Are we at parity?" — M155 honest re-assessment
- risks.md R6 — empirical discharge of the LlmDriver-pub-stable risk via M150