Outcome-parity results (Phase 3, M150-M156)

Status: SHIPPED on 5-fixture MultiPL-E-Rust POC corpus. Consolidates the empirical findings from M150-M154 + the spec-honesty refresh at M155-M156 into one citable place.

Executive summary

On the 5-fixture MultiPL-E-Rust HumanEval/0..4 POC corpus, the answer to the operator's question "so we can ask apr code to generate same code as claude code and 'it works'?" is YES:

Outcome parity = 1.0000 — both real claude (Anthropic 2.1.139) and real apr code (aprender 0.32.0 + Qwen2.5-Coder-1.5B-Instruct GGUF Q4_K_M) produce Rust that compiles and passes the test oracle on all 5 problems.
Test-survival = 1.0000 — every test from either system runs correctly against either implementation; the two systems are functionally interchangeable.
Structural similarity = 0.5201 — code looks ~50% similar at the line level; the divergence is purely stylistic (variable names, type annotations, idiom choice), not semantic.

The current parity claim is 5/5 BOTH_PASS scoped to this POC corpus. Bench expansion 5 → full 164 problems (M158+) will tighten the claim to a pass@1 + agreement curve.

The 4-metric grid

Metric	Value	Source	What it tells us
Outcome parity	1.0000 (5/5 BOTH_PASS)	M150 — `scripts/phase-3-bench.sh` → `evidence/phase-3/multipl-e-rust-scores.json`	Both systems pass their own MultiPL-E-Rust tests
Outcome parity gate	`>= 0.5` (current 1.0)	M152 — `FALSIFY-CCPA-016` in `ccpa-differ` (PROPOSED; threshold enforced via test)	POC-tier threshold enforced; bidirectional sensitivity via synthetic regression + identity fixtures
Structural similarity	0.5201 (range 0.33–0.83, zero byte-identical pairs)	M153 — `ccpa_differ::cross_output_equivalence()` → `evidence/phase-3/cross-output-equivalence.json`	Line-set Jaccard over trimmed non-empty lines
Test-survival	1.0000 (10/10 cross-swaps)	M154 — `scripts/phase-3-test-survival.sh` → `evidence/phase-3/test-survival.json`	Every test from either system passes against either implementation — SEMANTIC equivalence

Per-fixture detail

Fixture	Outcome (T/S)	Structural Jaccard	Cross-swap (a / b)
HumanEval_0 has_close_elements	PASS / PASS	0.8333 (10/12 lines shared)	swap_a=0 / swap_b=0
HumanEval_1 separate_paren_groups	PASS / PASS	0.3793 (11/29)	swap_a=0 / swap_b=0
HumanEval_2 truncate_number	PASS / PASS	0.4545 (5/11)	swap_a=0 / swap_b=0
HumanEval_3 below_zero	PASS / PASS	0.6000 (9/15)	swap_a=0 / swap_b=0
HumanEval_4 mean_absolute_deviation	PASS / PASS	0.3333 (4/12)	swap_a=0 / swap_b=0
Aggregate	5/5 / 5/5 BOTH_PASS	0.5201	10/10 PASS

swap_a = teacher's function compiled with student's tests. swap_b = student's function compiled with teacher's tests. Exit code 0 = cargo test PASS.

What this PROVES

apr code works end-to-end on real coding tasks — the orchestration loop (LLM driver + agent harness + tool dispatch) is functional with a local Qwen2.5-Coder-1.5B GGUF model on consumer hardware.
The user-facing parity question has a measured answer on this corpus — not a theoretical one. The 4 metrics are computed from real bilateral bench runs on the operator's noah-Lambda-Vector host, not synthesized or extrapolated.
Outcome parity, structural similarity, and semantic equivalence are orthogonal — two systems can BOTH pass while looking only ~50% alike at line level (M153), and they can be semantically equivalent at the test-survival level even when stylistically divergent (M154). The 4-metric grid surfaces these distinctions explicitly.
apr code ships in the default build — aprender#1638 MERGED 2026-05-13 (squash b61b76b4) — cargo install apr-cli now ships apr code without --features code. M150-M154 used a locally-built apr with the flag manually removed; M162 confirmed the upstream landing. The previously-suspected blocker PMAT-CODE-LLM-DRIVER-PUBLIC-001 (LlmDriver visibility) turned out to not gate the work; see completeness-assessment.md § Axis 3 for the discharge.

What this does NOT prove

Parity on the full 164-problem MultiPL-E-Rust corpus — the 5-problem POC sits at near-saturation territory (HumanEval problems with simple specs and clear test oracles); expansion will produce a tighter pass@1 + agreement curve and may surface divergence.
Parity on prompts with structural ambiguity — these 5 fixtures all have well-specified contracts (input/output types + example case). Open-ended prompts ("refactor this codebase", "design a new module") would test a different aspect of orchestration.
Parity on multi-turn / tool-using workflows — these fixtures use --max-turns 1 for apr code; multi-turn agent-loop tasks (file editing, codebase navigation, test-driven development) are outside the M150 bench scope.
Procedural parity at the OS level — separately measured at M148 (evidence/phase-2/measured-os-parity.json) and currently 0.3333 (well below the M139 FALSIFY-CCPA-014 threshold 0.95). Procedural parity is the diagnostic; outcome parity is the user-facing measure. They disagree by design — different runtimes (Node.js claude vs Rust apr) produce different syscall sets even when generating equivalent code.
Parity over time — model versions drift; Anthropic ships new claude releases; aprender ships new apr versions; bench numbers from 2026-05-12 are a snapshot, not a permanent claim.
Parity at project scale — see ProgramBench (Yang et al., 2026, arXiv:2605.03546): when LMs are asked to reconstruct full real-world programs (FFmpeg / SQLite / PHP interpreter) from executable + docs, 0% of 200 tasks are fully resolved and the best model passes ≥95% tests on only 3% of tasks. M150-M154's 1.0000 result is for function-level tasks (HumanEval-style single-function generation); project-scale outcome parity is the natural M161+ P3.6 future-work — see outcome-parity-plan.md § P3.6.

Axis 2 score interpretation

Milestone	Axis 2 score	Source
Pre-M136 (machinery TBD)	~30%	M111 honest assessment
Post-M141 (machinery shipped, no real captures)	~45%	M140 honest assessment
Post-M149 (Phase 3 plan documented)	~50%	M149 reframe
Post-M154 (4-metric real-binary evidence)	~70%	M155 honest re-assessment
Post-M162 (aprender#1638 MERGED; LlmDriver-adapter row FULLY DISCHARGED)	~75%	M162
Post-M164 (P3.5 contract bump v1.26.0; CCPA-015 + 016 ACTIVE_RUNTIME)	~80%	M164
Post-M167 (v1.27.0; CCPA-013 OPEN → ACTIVE_RUNTIME; last OPEN gate closed)	~82%	M167
Post-M177 (corpus 5 → 21; fixture validation layered; pre-commit integration)	~85%	M177
Post-M190 (Phase 4 P4.1-P4.5 SHIPPED — project-scale corpus + runner + scorer + CCPA-017 gate scaffold + contract bump v1.27.0 → v1.28.0 PROPOSED)	~87%	M190
Post-M210 (Phase 5 P5.1-P5.5 SHIPPED — Arena harness + multi-turn loop + bench runner + CCPA-018 gate scaffold + falsifier-of-falsifier comparator; contract bump v1.28.0 → v1.29.0 SHIPPED M208; coverage closure M210)	~90%	M210
One-number summary post-M210	~90%	M210 refresh

Remaining ~10% to fully closed Axis 2: (a) operator-dispatched recalibrated bench on the 21-fixture MultiPL-E-Rust corpus via bash scripts/phase-3-bench.sh (one-command; produces the next agreement curve); (b) operator-dispatched first project-scale bench on the 5-fixture corpus via bash scripts/phase-4-bench.sh (calibrates CCPA-017 threshold from real data; flips PROPOSED → ACTIVE_RUNTIME at v1.30.0); (c) operator-dispatched first Arena bench on the M182 corpus via bash scripts/phase-5-arena-bench.sh (calibrates CCPA-018 thresholds + answers the design-audit.md §5 Popperian test; flips PROPOSED → ACTIVE_RUNTIME at v1.30.0); (d) bench expansion 21 → full 164 MultiPL-E-Rust (operator-dispatched ~5-10h wall); (e) optional AST-level diff sub-metric (syn-based, complements M153 line-set Jaccard).

Evidence files

All numbers in this doc are reproducible from these checked-in evidence files:

evidence/phase-3/multipl-e-rust-scores.json — outcome parity (M150)
evidence/phase-3/cross-output-equivalence.json — structural similarity (M153)
evidence/phase-3/test-survival.json — test-survival rate (M154)
evidence/phase-3/captures/<id>/{teacher,student}.{src.rs,test-output,exit_code} — per-fixture per-side audit trail
evidence/phase-2/measured-os-parity.json — procedural parity (M148, diagnostic)
evidence/phase-2/captures/<id>/{teacher,student}.{ccpa-os-trace.jsonl,stderr,exit_code} — OS-event per-fixture (M147)

Gate registration status

Gate	Status	Source
FALSIFY-CCPA-014 OS-event parity bound (threshold 0.95)	ACTIVE_RUNTIME at v1.25.0	`claude-code-parity-apr-v1.yaml`
FALSIFY-CCPA-015 output purity	ACTIVE_RUNTIME at v1.26.0 (M164; was PROPOSED at v1.25.0 / M147)	`crates/ccpa-subproc/tests/falsify_ccpa_015_output_purity.rs`
FALSIFY-CCPA-016 outcome parity bound (threshold 0.5)	ACTIVE_RUNTIME at v1.26.0 (M164; was PROPOSED at v1.25.0 / M152)	`crates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs`

P3.5 contract bump v1.25.0 → v1.26.0 SHIPPED at M164 (2026-05-13) via the M22 5-step ritual mirroring aprender#1665 squash 9cbac28b5. Gate count: 14 → 16.

Cross-refs

outcome-parity-plan.md — original P3.1-P3.5 design + sub-deliverable status
phase-2-execution-plan.md — procedural (OS-level) parity plan
axis-2-closure-plan.md — M113 5-idea brainstorm
completeness-assessment.md § "Are we at parity?" — M155 honest re-assessment
risks.md R6 — empirical discharge of the LlmDriver-pub-stable risk via M150

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Outcome-parity results (Phase 3, M150-M156)

Executive summary

The 4-metric grid

Per-fixture detail

What this PROVES

What this does NOT prove

Axis 2 score interpretation

Evidence files

Gate registration status

Cross-refs

Uh oh!

FilesExpand file tree

outcome-parity-results.md

Latest commit

History

outcome-parity-results.md

File metadata and controls

Outcome-parity results (Phase 3, M150-M156)

Executive summary

The 4-metric grid

Per-fixture detail

What this PROVES

What this does NOT prove

Axis 2 score interpretation

Evidence files

Gate registration status

Cross-refs