Top spec: claude-code-parity-apr-poc.md | Design audit (M192) | Phase 5 falsification evidence (M224)
Soft-deprecated, not removed. FALSIFY-CCPA-008 (parity_score_bound) remains an active gate enforcing aggregate parity_score ≥ 0.95 on the AUTHORED canonical fixture corpus. The change is one of interpretation, not enforcement: the 1.0000 score is now framed as meter validation (the differ correctly recognizes equivalent traces), NOT as system-level parity validation (apr code and claude produce equivalent behavior on real tasks).
The "1.0 on 30/30 fixtures aggregate=1.0000" headline number, repeated across the project's lifetime, carried an implicit claim:
The CCPA harness validates that
apr codeproduces equivalent behavior toclaudeon real engineering tasks.
The M224 operator-dispatched Phase 5 Arena bench falsified this implicit claim empirically. Static fixtures (M150 MultiPL-E-Rust HumanEval/0..4, agreement = 1.0000) over-predicted live-Arena results (M224 M182 project-scale corpus, oracle_passed_rate = 0.0000 for BOTH claude AND apr code) by infinity (1.0 → 0.0). Per design-audit.md §5, the static-fixture approach is FALSIFIED as a convergence predictor.
The 1.0000 score is still meaningful — but as a different claim than the headline suggested:
- The differ logic is correct. Given two paired trace files describing equivalent actions, the differ recognizes them as equivalent.
- The scoring algorithm is correct. When two traces should produce a perfect score, they do.
- The per-tool equivalence rules are correct. Edit + Write + Bash + Read all map to comparable surface forms across both systems.
- The drift-category taxonomy is correct. When traces SHOULD diverge, the closed-enum
DriftCategorycatches the divergence.
In short: the meter works. What the meter measures, when pointed at AUTHORED inputs, is a closed-loop consistency check on the harness itself. That is a real engineering claim — it's just not the claim the headline implied.
The 1.0000 score does NOT mean:
apr codeandclaudeproduce equivalent behavior in production. The M224 evidence shows both fail 5/5 on real GitHub issues under the Arena harness.- The harness can predict user-facing parity from fixtures alone. Static fixtures over-predict by infinity in the M224 case.
- The 30 AUTHORED canonical fixtures cover the behavior space. They cover the harness's own meter surface, not the system's behavior surface.
These are the claims that the M224 bench falsified, and the claims this soft-deprecation removes from the project's user-facing narrative.
| Aspect | Before M230 | After M230 |
|---|---|---|
| CCPA-008 enforcement | Active (aggregate ≥ 0.95, per-fixture ≥ 0.80) | Active (unchanged) |
| CCPA-008 interpretation | "System produces parity" | "Meter recognizes parity correctly" |
| CCPA-008 status in contract | Active gate (no formal status field — it's M6-era) | ADVISORY at v1.30.0 (operator-coordinated upstream) |
| Foreground parity claim | "1.0 on 30/30 fixtures" | "0/5 on M182 project-scale (M224) + 1.0 on 5/5 MultiPL-E-Rust HumanEval (M150)" |
| Foreground parity gates | CCPA-008 (the M6 flagship) | CCPA-016 (outcome) + CCPA-017 (project-scale) + CCPA-018 (Arena recovery) |
| Top-spec Axis 2 score | ~90% (over-claimed) | ~55% (M224 honest re-assessment) |
| The headline answer | "YES, we are at parity" | "YES on function-scale; NO on project-scale; unknown on most real-world workloads" |
- The gate is not deleted. CCPA-008 stays in the contract registry, the test still runs, the threshold still enforces.
- The 30-fixture corpus is not deleted.
fixtures/canonical/remains. - The MultiPL-E-Rust 5-fixture result is not retracted. It is a real measurement of function-scale outcome parity.
- CCPA-013 is not affected. Its assertion is
≥5 fixtures, aggregate ≥ 0.95; that's still satisfied at the meter-validation interpretation. - CCPA-001..007 (trace schema, replay determinism, mock completeness, tool-call equivalence, file-mutation equivalence, sovereignty, corpus coverage) are not affected. These are independent meter-validation gates and continue to operate.
- Top-spec headline prose (claude-code-parity-apr-poc.md § Completeness summary) — already updated at M224. The "Are we at parity?" answer is now bifurcated by scale (YES function-scale / NO project-scale) and the Axis 2 score is ~55% not ~90%.
- CCPA-008 description in falsification-conditions.md — annotated post-M230 to state "validates METER not SYSTEM" and cross-reference this doc.
- CCPA-017 + CCPA-018 promoted to foreground — these now carry the user-facing parity claims (with the acknowledgement that current data shows 0/5 for both systems, and the gates remain PROPOSED pending operator-dispatched bench evidence that lifts oracle_passed_rate or recovery_rate above the thresholds).
- Upstream contract amendment (deferred to v1.30.0 PR on aprender, operator-coordinated) — adds an explicit status field to CCPA-008 with value
ADVISORY(or equivalent lifecycle annotation) and asemantic_change_logentry recording the M230 reframe.
| Milestone | Action | Outcome |
|---|---|---|
| M0–M6 | Original POC scope: action-stream parity over AUTHORED canonical fixtures | CCPA-008 authored as M6 flagship |
| M11 | First measured parity ≥ 5 fixtures, aggregate ≥ 0.95 | CCPA-013 OPEN (later DISCHARGED at 30 fixtures, 1.0000) |
| M111 | First honest 3-axis breakdown — Axis 2 ~30% | Operator-prompted framework: meter vs. system |
| M118 | deepclaude prior-art DISCHARGES R2 technical premise | HTTPS proxy is technically feasible — but operationally rescoped OOS |
| M136-M141 | Axis-2-closure-plan idea (2) ships — CLI subprocess instrumentation + OS-event parity (CCPA-014) | Different validation surface than CCPA-008 |
| M150-M154 | Phase 3 outcome-parity SHIPPED — outcome 1.0000 + structural 0.5201 + test-survival 1.0000 on MultiPL-E-Rust 5-fixture | First real-binary evidence; function-scale parity confirmed |
| M159 | ProgramBench prior-art integrated (0%/200 SOTA-model baseline) | Validates the "function-scale 1.0 does not extrapolate to project-scale" caveat — was foreseeing M224 |
| M180-M190 | Phase 4 project-scale plan + corpus + runner + scorer + CCPA-017 gate (PROPOSED at v1.28.0) | Machinery for project-scale measurement built; no evidence yet |
| M192 | Operator authors design-audit.md — Popperian falsifier explicitly states: "if static ≥ 0.95 AND arena ~ 0, static approach is FALSIFIED" |
The framework that M224 would empirically test |
| M194-M210 | Phase 5 Arena runner + bench + CCPA-018 gate + Popperian comparator + coverage closure | Machinery for live-Arena measurement built |
| M224 | First operator-dispatched Phase 5 Arena bench | 0/5 oracle_passed_rate for BOTH claude AND apr code. design-audit.md §5 Popperian verdict: StaticFalsified. |
| M226 | aprender#1712 filed (apr-serve leak) + pkill apr serve workaround |
Reduces but doesn't eliminate methodology confounds |
| M228 | Second operator-dispatched Arena bench (post-workaround) | Same verdict 0/5; cleaner student data on 2 of 5 fixtures; the leaked-process confound was NOT the cause |
| M230 (this doc) | Soft-deprecate CCPA-008 — reframe meter-vs-system | Audit trail complete; the user-facing parity narrative now matches the measured reality |
- evidence/phase-5/static-fixture-falsification.md — measured inputs + per-fixture verdict that triggers this M-row
- design-audit.md § 5 — operator-authored Popperian falsifier
- falsification-conditions.md § CCPA-008 — gate registry entry (annotated post-M230)
- crates/ccpa-arena/src/falsifier.rs —
evaluate_static_vs_arenadeterministic comparator - aprender#1712 — apr serve leak upstream
- Future:
paiml/aprenderv1.30.0 PR (operator-coordinated) — contract amendment formalizing CCPA-008 status: ADVISORY + v1.29.0 → v1.30.0 status_history entry citing M224 + M230
The CCPA harness was always two things:
- A meter — code that recognizes when two trace files describe equivalent actions.
- A test — that meter, pointed at
apr code's actual output, would catch divergence fromclaude's output.
M0–M223 we shipped the meter and assumed the test was implicitly working because the fixtures passed. M224 was the first time the test was pointed at REAL system output on REAL tasks. The test answered: the static-fixture corpus was validating the meter, not the system.
That's not a failure of the meter — the meter works. It's a recognition that meter-validation is not system-validation, and the project's user-facing narrative needs to reflect that distinction.
Going forward:
- The meter (CCPA-001..008) is ADVISORY for system claims, ENFORCING for meter claims.
- The system test (CCPA-016 / CCPA-017 / CCPA-018) is the foreground parity claim — currently 0/5 on project-scale, and that's the honest number.
- Closing the gap requires either (a) fixing apr-side bugs that block clean student-side data (aprender#1712 + likely others), (b) reviewing fixture difficulty and oracle strictness, or (c) acknowledging that under this harness, neither SOTA agent solves these tasks zero-shot — that is itself a legitimate user-facing finding.