A record-replay-distill harness measuring
apr codeagainst Claude Code at the action-stream level. Paired teacher/student traces, per-tool semantic-equivalence rules, a falsifiable parity score, and a live multi-turn Arena bench.
The full technical companion is hosted at paiml.github.io/claude-code-parity-apr — methodology, falsifier gates, the empirical V1_004 chain, CLI reference, and the academic basis.
# build the book locally
mdbook build book/ && open book/book/index.html| Dimension | Result | n | Source |
|---|---|---|---|
| Function-scale outcome parity (HumanEval) | 1.0000 | 5 | M150 — evidence/phase-3/multipl-e-rust-scores.json |
| Function-scale test-survival (cross-swap) | 1.0000 | 10 | M154 — evidence/phase-3/test-survival.json |
| Project-scale Arena (claude) | 0.20 (1/5) | 5 | M234 — evidence/phase-5/arena-scores.json |
| Project-scale Arena (apr code) | 0.00 (0/5) | 5 | M234 — same |
| Static-vs-Arena Popperian verdict | StaticFalsified | — | M234 design-audit.md §5 |
| Canonical corpus aggregate | 1.0000 | 30 | M150 — fixtures/canonical/measured-parity.json |
| Contract version | v1.32.0 | — | aprender-side authoritative |
| Falsification gates | 20/20 registered | — | 16 ACTIVE_RUNTIME + 4 PROPOSED |
| Sub-milestones shipped | M0–M296 all SHIPPED | — | continuous since 2026-04-26 |
Honest framing: at function-scale, the two systems are functionally interchangeable. At project-scale, the static-fixture approach is Popperian-falsified as a project-scale predictor. Contract at v1.32.0 (aprender-side authoritative). The full 3-axis breakdown is in docs/specifications/completeness-assessment.md.
CCPA treats Claude Code as the teacher and apr code as the student. Two complementary measurement paths cross-falsify each other:
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────── STATIC PATH ────────────┐ │
│ │ AUTHORED teacher.ccpa-trace.jsonl │ validates THE METER │
│ │ + │ (does the differ │
│ │ AUTHORED student.ccpa-trace.jsonl │ recognize equivalence?) │
│ │ │ │ │
│ │ ▼ │ │
│ │ ccpa-differ::compute_parity_score │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ParityReport { score, drifts } │ │
│ └─────────────────────────────────────┘ │
│ │
│ ┌──────────── ARENA PATH ─────────────┐ │
│ │ live claude + live apr code │ validates THE SYSTEM │
│ │ multi-turn loop (max_turns=20) │ (does apr code solve │
│ │ per-fixture test-shaped oracle │ real tasks like claude?) │
│ │ │ │ │
│ │ ▼ │ │
│ │ ArenaOutcome { ... } │ │
│ │ evidence/phase-{5,6}/ │ │
│ └─────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Per FALSIFY-CCPA-019, every Arena verdict requires a fresh bidirectional-sensitivity calibration on file. Codifies the M196-M224 four-bug-stack lesson as a permanent contract gate.
→ Full architecture: book / Architecture at a glance
The most empirically interesting work in CCPA's history. Through M286-M294 we isolated the load-bearing variable behind 0% tool_call emission on Qwen-Coder models:
| Hypothesis | Tested via | Outcome |
|---|---|---|
| Inference stack quality | M286 KV cache + 3-knob + EOS + clean_chat_output | Necessary fix; not sufficient |
| Active params count | 3B (30B-A3B-MoE) vs 7B (dense 7B-Coder) | Both 0 tool_calls — refuted |
| MoE vs dense | qwen3_moe vs qwen2 | Same pattern — refuted |
| Few-shot prompt examples | 3 concrete <tool_call> examples |
No pattern shift — refuted |
| Qwen-Coder finetune family | Smoke-tested non-Coder Qwen3-30B-A3B-Instruct-2507 |
Emitted {"name":"file_read","input":{...}} in 20 tokens — confirmed |
→ Full empirical narrative: book / The V1_004 chain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # install rust toolkit
bash # puts cargo on your path from above install
make install-tools # local tools matching CI exactly
make install-hooks # FALSIFY-CCPA-012 pre-commit hook
make tier3 # full local gate sweep (fmt + clippy + tests + coverage + comply + pv)# Score a single teacher/student pair
ccpa diff fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl \
fixtures/canonical/0001-edit-readme/student.ccpa-trace.jsonl
# Score the whole corpus + bidirectional-sensitivity check
ccpa corpus fixtures/canonical/ # canonical MUST PASS
ccpa corpus fixtures/regression/ # regression MUST FAIL
# Walk the parity-matrix coverage gate
ccpa coverage \
--apr-code-parity-yaml ../aprender/contracts/apr-code-parity-v1.yaml \
--fixtures-dir fixtures/canonical/
# Validate a JSONL trace against the schema
ccpa validate fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonlRequire claude login + apr code + a local GGUF model.
bash scripts/phase-3-bench.sh # function-scale MultiPL-E-Rust HumanEval
bash scripts/phase-5-arena-bench.sh # project-scale Arena (5 real GitHub-issue fixtures)
bash scripts/phase-5-calibration-bench.sh # calibration-and-scale (15 deterministic fixtures, M242)
bash scripts/phase-6-bench.sh # under-contract dispatch (20 fixtures, pmat comply per turn)→ Full CLI reference: book / CLI
20 gates, all pv validate-mechanically asserted on every PR per CLAUDE.md § "DOGFOOD pv, NEVER bash".
Source-of-truth invariants (M0+):
| ID | Name | Mechanism |
|---|---|---|
FALSIFY-CCPA-009 |
ci_main_branch_green |
branch protection requires ci/gate |
FALSIFY-CCPA-010 |
pmat_comply_100pct |
pmat comply check: is_compliant=true ∧ 0 Fail-status checks |
FALSIFY-CCPA-011 |
line_coverage_100pct |
cargo llvm-cov: 100% functions ∧ ≥99% lines (refined v0.4.0) |
FALSIFY-CCPA-012 |
pv_contract_gate_on_commit |
pre-commit hook + CI run pv validate + pin-check |
Behavioral parity gates (full list in book / Gates):
CCPA-001 (trace_schema_roundtrip) · CCPA-002 (replay_determinism) · CCPA-004 (tool_call_equivalence) · CCPA-005 (file_mutation_equivalence) · CCPA-006 (sovereignty_on_replay) · CCPA-007 (corpus_coverage, HARD-BLOCKING) · CCPA-013 (first_recorded_parity_score, DISCHARGED) · CCPA-014 (os_event_parity_bound) · CCPA-015 (os_trace_output_purity) · CCPA-016 (outcome_parity_bound) · CCPA-017–020 (PROPOSED)
| Concern | Lives in |
|---|---|
| Contract TEXT | paiml/aprender/contracts/claude-code-parity-apr-v1.yaml (canonical) — pinned here via contracts/pin.lock |
| Spec | docs/specifications/claude-code-parity-apr-poc.md (canonical here since M1) |
| Implementation, fixtures, CI, coverage, pmat-comply | this repo (canonical) |
This split follows aprender's monorepo single-source-of-truth policy: aprender stays canonical for contract TEXT (where every paiml contract lives), while this repo is canonical for runtime ENFORCEMENT.
mkdir fixtures/canonical/00XX-my-scenario
cat > fixtures/canonical/00XX-my-scenario/meta.toml <<EOF
[fixture]
id = "00XX-my-scenario"
covers = ["builtin-tools-rwegs"] # or hooks, skills, slash-commands, ...
description = "What this fixture exercises and why."
EOF
# Author the paired teacher.ccpa-trace.jsonl + student.ccpa-trace.jsonl
ccpa corpus fixtures/canonical/ # MUST exit 0
ccpa coverage --apr-code-parity-yaml ... --oos-rows ... # MUST exit 0
make tier3 # full local gate sweep→ Full fixture reference: book / Fixtures
crates/
├── ccpa-trace/ # JSONL trace schema, types, validators
├── ccpa-differ/ # per-tool equivalence rules, parity score
├── ccpa-recorder/ # stream-json parser (claude side)
├── ccpa-subproc/ # subprocess driver (deterministic stdout/stderr capture)
├── ccpa-replayer/ # mock harness for replay determinism
├── ccpa-arena/ # multi-turn live runner + bench binary
└── ccpa-cli/ # `ccpa` user-facing binary
docs/specifications/ # 25 spec files (all <500 LOC, doc-drift gated)
evidence/ # per-phase measured-output snapshots
fixtures/ # canonical, regression, project-scale, calibration-and-scale, under-contract
book/ # mdBook source for paiml.github.io/claude-code-parity-apr
| Paper | Informs |
|---|---|
| Hinton et al., 1503.02531 — Distilling the Knowledge in a Neural Network | action-stream distillation framing |
| Segura et al., 2208.08227 — MultiPL-E | function-scale outcome-parity benchmark |
| Jimenez et al., 2310.06770 — SWE-bench | project-scale Arena corpus design |
| METTLE, LLMORPH | metamorphic relations on action streams |
| 2207.11976 — Differential Testing of DL Frameworks | differential-testing scoring function |
| 2505.03096 — Chaos Engineering for LLM Systems | regression-corpus design |
→ Full per-gate mapping: book / Academic basis · spec section in docs/specifications/academic-basis.md
Apache-2.0 OR MIT.