Skip to content

Latest commit

 

History

History
108 lines (77 loc) · 9.97 KB

File metadata and controls

108 lines (77 loc) · 9.97 KB

Static-fixture soft-deprecation (M230)

Top spec: claude-code-parity-apr-poc.md | Design audit (M192) | Phase 5 falsification evidence (M224)

Status

Soft-deprecated, not removed. FALSIFY-CCPA-008 (parity_score_bound) remains an active gate enforcing aggregate parity_score ≥ 0.95 on the AUTHORED canonical fixture corpus. The change is one of interpretation, not enforcement: the 1.0000 score is now framed as meter validation (the differ correctly recognizes equivalent traces), NOT as system-level parity validation (apr code and claude produce equivalent behavior on real tasks).

What changed and why

The over-extrapolation that the M224 Arena bench falsified

The "1.0 on 30/30 fixtures aggregate=1.0000" headline number, repeated across the project's lifetime, carried an implicit claim:

The CCPA harness validates that apr code produces equivalent behavior to claude on real engineering tasks.

The M224 operator-dispatched Phase 5 Arena bench falsified this implicit claim empirically. Static fixtures (M150 MultiPL-E-Rust HumanEval/0..4, agreement = 1.0000) over-predicted live-Arena results (M224 M182 project-scale corpus, oracle_passed_rate = 0.0000 for BOTH claude AND apr code) by infinity (1.0 → 0.0). Per design-audit.md §5, the static-fixture approach is FALSIFIED as a convergence predictor.

What CCPA-008 actually validates (the surviving claim)

The 1.0000 score is still meaningful — but as a different claim than the headline suggested:

  1. The differ logic is correct. Given two paired trace files describing equivalent actions, the differ recognizes them as equivalent.
  2. The scoring algorithm is correct. When two traces should produce a perfect score, they do.
  3. The per-tool equivalence rules are correct. Edit + Write + Bash + Read all map to comparable surface forms across both systems.
  4. The drift-category taxonomy is correct. When traces SHOULD diverge, the closed-enum DriftCategory catches the divergence.

In short: the meter works. What the meter measures, when pointed at AUTHORED inputs, is a closed-loop consistency check on the harness itself. That is a real engineering claim — it's just not the claim the headline implied.

What CCPA-008 does NOT validate (the dropped claim)

The 1.0000 score does NOT mean:

  1. apr code and claude produce equivalent behavior in production. The M224 evidence shows both fail 5/5 on real GitHub issues under the Arena harness.
  2. The harness can predict user-facing parity from fixtures alone. Static fixtures over-predict by infinity in the M224 case.
  3. The 30 AUTHORED canonical fixtures cover the behavior space. They cover the harness's own meter surface, not the system's behavior surface.

These are the claims that the M224 bench falsified, and the claims this soft-deprecation removes from the project's user-facing narrative.

The lifecycle move

Aspect Before M230 After M230
CCPA-008 enforcement Active (aggregate ≥ 0.95, per-fixture ≥ 0.80) Active (unchanged)
CCPA-008 interpretation "System produces parity" "Meter recognizes parity correctly"
CCPA-008 status in contract Active gate (no formal status field — it's M6-era) ADVISORY at v1.30.0 (operator-coordinated upstream)
Foreground parity claim "1.0 on 30/30 fixtures" "0/5 on M182 project-scale (M224) + 1.0 on 5/5 MultiPL-E-Rust HumanEval (M150)"
Foreground parity gates CCPA-008 (the M6 flagship) CCPA-016 (outcome) + CCPA-017 (project-scale) + CCPA-018 (Arena recovery)
Top-spec Axis 2 score ~90% (over-claimed) ~55% (M224 honest re-assessment)
The headline answer "YES, we are at parity" "YES on function-scale; NO on project-scale; unknown on most real-world workloads"

What is NOT being changed

  • The gate is not deleted. CCPA-008 stays in the contract registry, the test still runs, the threshold still enforces.
  • The 30-fixture corpus is not deleted. fixtures/canonical/ remains.
  • The MultiPL-E-Rust 5-fixture result is not retracted. It is a real measurement of function-scale outcome parity.
  • CCPA-013 is not affected. Its assertion is ≥5 fixtures, aggregate ≥ 0.95; that's still satisfied at the meter-validation interpretation.
  • CCPA-001..007 (trace schema, replay determinism, mock completeness, tool-call equivalence, file-mutation equivalence, sovereignty, corpus coverage) are not affected. These are independent meter-validation gates and continue to operate.

What IS being changed

  1. Top-spec headline prose (claude-code-parity-apr-poc.md § Completeness summary) — already updated at M224. The "Are we at parity?" answer is now bifurcated by scale (YES function-scale / NO project-scale) and the Axis 2 score is ~55% not ~90%.
  2. CCPA-008 description in falsification-conditions.md — annotated post-M230 to state "validates METER not SYSTEM" and cross-reference this doc.
  3. CCPA-017 + CCPA-018 promoted to foreground — these now carry the user-facing parity claims (with the acknowledgement that current data shows 0/5 for both systems, and the gates remain PROPOSED pending operator-dispatched bench evidence that lifts oracle_passed_rate or recovery_rate above the thresholds).
  4. Upstream contract amendment (deferred to v1.30.0 PR on aprender, operator-coordinated) — adds an explicit status field to CCPA-008 with value ADVISORY (or equivalent lifecycle annotation) and a semantic_change_log entry recording the M230 reframe.

Audit trail — how we got here

Milestone Action Outcome
M0–M6 Original POC scope: action-stream parity over AUTHORED canonical fixtures CCPA-008 authored as M6 flagship
M11 First measured parity ≥ 5 fixtures, aggregate ≥ 0.95 CCPA-013 OPEN (later DISCHARGED at 30 fixtures, 1.0000)
M111 First honest 3-axis breakdown — Axis 2 ~30% Operator-prompted framework: meter vs. system
M118 deepclaude prior-art DISCHARGES R2 technical premise HTTPS proxy is technically feasible — but operationally rescoped OOS
M136-M141 Axis-2-closure-plan idea (2) ships — CLI subprocess instrumentation + OS-event parity (CCPA-014) Different validation surface than CCPA-008
M150-M154 Phase 3 outcome-parity SHIPPED — outcome 1.0000 + structural 0.5201 + test-survival 1.0000 on MultiPL-E-Rust 5-fixture First real-binary evidence; function-scale parity confirmed
M159 ProgramBench prior-art integrated (0%/200 SOTA-model baseline) Validates the "function-scale 1.0 does not extrapolate to project-scale" caveat — was foreseeing M224
M180-M190 Phase 4 project-scale plan + corpus + runner + scorer + CCPA-017 gate (PROPOSED at v1.28.0) Machinery for project-scale measurement built; no evidence yet
M192 Operator authors design-audit.md — Popperian falsifier explicitly states: "if static ≥ 0.95 AND arena ~ 0, static approach is FALSIFIED" The framework that M224 would empirically test
M194-M210 Phase 5 Arena runner + bench + CCPA-018 gate + Popperian comparator + coverage closure Machinery for live-Arena measurement built
M224 First operator-dispatched Phase 5 Arena bench 0/5 oracle_passed_rate for BOTH claude AND apr code. design-audit.md §5 Popperian verdict: StaticFalsified.
M226 aprender#1712 filed (apr-serve leak) + pkill apr serve workaround Reduces but doesn't eliminate methodology confounds
M228 Second operator-dispatched Arena bench (post-workaround) Same verdict 0/5; cleaner student data on 2 of 5 fixtures; the leaked-process confound was NOT the cause
M230 (this doc) Soft-deprecate CCPA-008 — reframe meter-vs-system Audit trail complete; the user-facing parity narrative now matches the measured reality

Cross-references

Bottom line

The CCPA harness was always two things:

  1. A meter — code that recognizes when two trace files describe equivalent actions.
  2. A test — that meter, pointed at apr code's actual output, would catch divergence from claude's output.

M0–M223 we shipped the meter and assumed the test was implicitly working because the fixtures passed. M224 was the first time the test was pointed at REAL system output on REAL tasks. The test answered: the static-fixture corpus was validating the meter, not the system.

That's not a failure of the meter — the meter works. It's a recognition that meter-validation is not system-validation, and the project's user-facing narrative needs to reflect that distinction.

Going forward:

  • The meter (CCPA-001..008) is ADVISORY for system claims, ENFORCING for meter claims.
  • The system test (CCPA-016 / CCPA-017 / CCPA-018) is the foreground parity claim — currently 0/5 on project-scale, and that's the honest number.
  • Closing the gap requires either (a) fixing apr-side bugs that block clean student-side data (aprender#1712 + likely others), (b) reviewing fixture difficulty and oracle strictness, or (c) acknowledging that under this harness, neither SOTA agent solves these tasks zero-shot — that is itself a legitimate user-facing finding.