Skip to content

Latest commit

 

History

History
98 lines (73 loc) · 9.95 KB

File metadata and controls

98 lines (73 loc) · 9.95 KB

Outcome-parity results (Phase 3, M150-M156)

Status: SHIPPED on 5-fixture MultiPL-E-Rust POC corpus. Consolidates the empirical findings from M150-M154 + the spec-honesty refresh at M155-M156 into one citable place.

Executive summary

On the 5-fixture MultiPL-E-Rust HumanEval/0..4 POC corpus, the answer to the operator's question "so we can ask apr code to generate same code as claude code and 'it works'?" is YES:

  • Outcome parity = 1.0000 — both real claude (Anthropic 2.1.139) and real apr code (aprender 0.32.0 + Qwen2.5-Coder-1.5B-Instruct GGUF Q4_K_M) produce Rust that compiles and passes the test oracle on all 5 problems.
  • Test-survival = 1.0000 — every test from either system runs correctly against either implementation; the two systems are functionally interchangeable.
  • Structural similarity = 0.5201 — code looks ~50% similar at the line level; the divergence is purely stylistic (variable names, type annotations, idiom choice), not semantic.

The current parity claim is 5/5 BOTH_PASS scoped to this POC corpus. Bench expansion 5 → full 164 problems (M158+) will tighten the claim to a pass@1 + agreement curve.

The 4-metric grid

Metric Value Source What it tells us
Outcome parity 1.0000 (5/5 BOTH_PASS) M150 — scripts/phase-3-bench.shevidence/phase-3/multipl-e-rust-scores.json Both systems pass their own MultiPL-E-Rust tests
Outcome parity gate >= 0.5 (current 1.0) M152 — FALSIFY-CCPA-016 in ccpa-differ (PROPOSED; threshold enforced via test) POC-tier threshold enforced; bidirectional sensitivity via synthetic regression + identity fixtures
Structural similarity 0.5201 (range 0.33–0.83, zero byte-identical pairs) M153 — ccpa_differ::cross_output_equivalence()evidence/phase-3/cross-output-equivalence.json Line-set Jaccard over trimmed non-empty lines
Test-survival 1.0000 (10/10 cross-swaps) M154 — scripts/phase-3-test-survival.shevidence/phase-3/test-survival.json Every test from either system passes against either implementation — SEMANTIC equivalence

Per-fixture detail

Fixture Outcome (T/S) Structural Jaccard Cross-swap (a / b)
HumanEval_0 has_close_elements PASS / PASS 0.8333 (10/12 lines shared) swap_a=0 / swap_b=0
HumanEval_1 separate_paren_groups PASS / PASS 0.3793 (11/29) swap_a=0 / swap_b=0
HumanEval_2 truncate_number PASS / PASS 0.4545 (5/11) swap_a=0 / swap_b=0
HumanEval_3 below_zero PASS / PASS 0.6000 (9/15) swap_a=0 / swap_b=0
HumanEval_4 mean_absolute_deviation PASS / PASS 0.3333 (4/12) swap_a=0 / swap_b=0
Aggregate 5/5 / 5/5 BOTH_PASS 0.5201 10/10 PASS

swap_a = teacher's function compiled with student's tests. swap_b = student's function compiled with teacher's tests. Exit code 0 = cargo test PASS.

What this PROVES

  1. apr code works end-to-end on real coding tasks — the orchestration loop (LLM driver + agent harness + tool dispatch) is functional with a local Qwen2.5-Coder-1.5B GGUF model on consumer hardware.
  2. The user-facing parity question has a measured answer on this corpus — not a theoretical one. The 4 metrics are computed from real bilateral bench runs on the operator's noah-Lambda-Vector host, not synthesized or extrapolated.
  3. Outcome parity, structural similarity, and semantic equivalence are orthogonal — two systems can BOTH pass while looking only ~50% alike at line level (M153), and they can be semantically equivalent at the test-survival level even when stylistically divergent (M154). The 4-metric grid surfaces these distinctions explicitly.
  4. apr code ships in the default build — aprender#1638 MERGED 2026-05-13 (squash b61b76b4) — cargo install apr-cli now ships apr code without --features code. M150-M154 used a locally-built apr with the flag manually removed; M162 confirmed the upstream landing. The previously-suspected blocker PMAT-CODE-LLM-DRIVER-PUBLIC-001 (LlmDriver visibility) turned out to not gate the work; see completeness-assessment.md § Axis 3 for the discharge.

What this does NOT prove

  1. Parity on the full 164-problem MultiPL-E-Rust corpus — the 5-problem POC sits at near-saturation territory (HumanEval problems with simple specs and clear test oracles); expansion will produce a tighter pass@1 + agreement curve and may surface divergence.
  2. Parity on prompts with structural ambiguity — these 5 fixtures all have well-specified contracts (input/output types + example case). Open-ended prompts ("refactor this codebase", "design a new module") would test a different aspect of orchestration.
  3. Parity on multi-turn / tool-using workflows — these fixtures use --max-turns 1 for apr code; multi-turn agent-loop tasks (file editing, codebase navigation, test-driven development) are outside the M150 bench scope.
  4. Procedural parity at the OS level — separately measured at M148 (evidence/phase-2/measured-os-parity.json) and currently 0.3333 (well below the M139 FALSIFY-CCPA-014 threshold 0.95). Procedural parity is the diagnostic; outcome parity is the user-facing measure. They disagree by design — different runtimes (Node.js claude vs Rust apr) produce different syscall sets even when generating equivalent code.
  5. Parity over time — model versions drift; Anthropic ships new claude releases; aprender ships new apr versions; bench numbers from 2026-05-12 are a snapshot, not a permanent claim.
  6. Parity at project scale — see ProgramBench (Yang et al., 2026, arXiv:2605.03546): when LMs are asked to reconstruct full real-world programs (FFmpeg / SQLite / PHP interpreter) from executable + docs, 0% of 200 tasks are fully resolved and the best model passes ≥95% tests on only 3% of tasks. M150-M154's 1.0000 result is for function-level tasks (HumanEval-style single-function generation); project-scale outcome parity is the natural M161+ P3.6 future-work — see outcome-parity-plan.md § P3.6.

Axis 2 score interpretation

Milestone Axis 2 score Source
Pre-M136 (machinery TBD) ~30% M111 honest assessment
Post-M141 (machinery shipped, no real captures) ~45% M140 honest assessment
Post-M149 (Phase 3 plan documented) ~50% M149 reframe
Post-M154 (4-metric real-binary evidence) ~70% M155 honest re-assessment
Post-M162 (aprender#1638 MERGED; LlmDriver-adapter row FULLY DISCHARGED) ~75% M162
Post-M164 (P3.5 contract bump v1.26.0; CCPA-015 + 016 ACTIVE_RUNTIME) ~80% M164
Post-M167 (v1.27.0; CCPA-013 OPEN → ACTIVE_RUNTIME; last OPEN gate closed) ~82% M167
Post-M177 (corpus 5 → 21; fixture validation layered; pre-commit integration) ~85% M177
Post-M190 (Phase 4 P4.1-P4.5 SHIPPED — project-scale corpus + runner + scorer + CCPA-017 gate scaffold + contract bump v1.27.0 → v1.28.0 PROPOSED) ~87% M190
Post-M210 (Phase 5 P5.1-P5.5 SHIPPED — Arena harness + multi-turn loop + bench runner + CCPA-018 gate scaffold + falsifier-of-falsifier comparator; contract bump v1.28.0 → v1.29.0 SHIPPED M208; coverage closure M210) ~90% M210
One-number summary post-M210 ~90% M210 refresh

Remaining ~10% to fully closed Axis 2: (a) operator-dispatched recalibrated bench on the 21-fixture MultiPL-E-Rust corpus via bash scripts/phase-3-bench.sh (one-command; produces the next agreement curve); (b) operator-dispatched first project-scale bench on the 5-fixture corpus via bash scripts/phase-4-bench.sh (calibrates CCPA-017 threshold from real data; flips PROPOSED → ACTIVE_RUNTIME at v1.30.0); (c) operator-dispatched first Arena bench on the M182 corpus via bash scripts/phase-5-arena-bench.sh (calibrates CCPA-018 thresholds + answers the design-audit.md §5 Popperian test; flips PROPOSED → ACTIVE_RUNTIME at v1.30.0); (d) bench expansion 21 → full 164 MultiPL-E-Rust (operator-dispatched ~5-10h wall); (e) optional AST-level diff sub-metric (syn-based, complements M153 line-set Jaccard).

Evidence files

All numbers in this doc are reproducible from these checked-in evidence files:

  • evidence/phase-3/multipl-e-rust-scores.json — outcome parity (M150)
  • evidence/phase-3/cross-output-equivalence.json — structural similarity (M153)
  • evidence/phase-3/test-survival.json — test-survival rate (M154)
  • evidence/phase-3/captures/<id>/{teacher,student}.{src.rs,test-output,exit_code} — per-fixture per-side audit trail
  • evidence/phase-2/measured-os-parity.json — procedural parity (M148, diagnostic)
  • evidence/phase-2/captures/<id>/{teacher,student}.{ccpa-os-trace.jsonl,stderr,exit_code} — OS-event per-fixture (M147)

Gate registration status

Gate Status Source
FALSIFY-CCPA-014 OS-event parity bound (threshold 0.95) ACTIVE_RUNTIME at v1.25.0 claude-code-parity-apr-v1.yaml
FALSIFY-CCPA-015 output purity ACTIVE_RUNTIME at v1.26.0 (M164; was PROPOSED at v1.25.0 / M147) crates/ccpa-subproc/tests/falsify_ccpa_015_output_purity.rs
FALSIFY-CCPA-016 outcome parity bound (threshold 0.5) ACTIVE_RUNTIME at v1.26.0 (M164; was PROPOSED at v1.25.0 / M152) crates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs

P3.5 contract bump v1.25.0 → v1.26.0 SHIPPED at M164 (2026-05-13) via the M22 5-step ritual mirroring aprender#1665 squash 9cbac28b5. Gate count: 14 → 16.

Cross-refs