claude-code-parity-apr

A record-replay-distill harness measuring apr code against Claude Code at the action-stream level. Paired teacher/student traces, per-tool semantic-equivalence rules, a falsifiable parity score, and a live multi-turn Arena bench.

📖 Read the book

The full technical companion is hosted at paiml.github.io/claude-code-parity-apr — methodology, falsifier gates, the empirical V1_004 chain, CLI reference, and the academic basis.

# build the book locally
mdbook build book/ && open book/book/index.html

At a glance

Dimension	Result	n	Source
Function-scale outcome parity (HumanEval)	1.0000	5	M150 — `evidence/phase-3/multipl-e-rust-scores.json`
Function-scale test-survival (cross-swap)	1.0000	10	M154 — `evidence/phase-3/test-survival.json`
Project-scale Arena (claude)	0.20 (1/5)	5	M234 — `evidence/phase-5/arena-scores.json`
Project-scale Arena (apr code)	0.00 (0/5)	5	M234 — same
Static-vs-Arena Popperian verdict	StaticFalsified	—	M234 design-audit.md §5
Canonical corpus aggregate	1.0000	30	M150 — `fixtures/canonical/measured-parity.json`
Contract version	v1.32.0	—	aprender-side authoritative
Falsification gates	20/20 registered	—	16 ACTIVE_RUNTIME + 4 PROPOSED
Sub-milestones shipped	M0–M296 all SHIPPED	—	continuous since 2026-04-26

Honest framing: at function-scale, the two systems are functionally interchangeable. At project-scale, the static-fixture approach is Popperian-falsified as a project-scale predictor. Contract at v1.32.0 (aprender-side authoritative). The full 3-axis breakdown is in docs/specifications/completeness-assessment.md.

How it works

CCPA treats Claude Code as the teacher and apr code as the student. Two complementary measurement paths cross-falsify each other:

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│    ┌──────────── STATIC PATH ────────────┐                              │
│    │  AUTHORED teacher.ccpa-trace.jsonl  │   validates THE METER        │
│    │            +                        │   (does the differ           │
│    │  AUTHORED student.ccpa-trace.jsonl  │    recognize equivalence?)   │
│    │            │                        │                              │
│    │            ▼                        │                              │
│    │   ccpa-differ::compute_parity_score │                              │
│    │            │                        │                              │
│    │            ▼                        │                              │
│    │     ParityReport { score, drifts }  │                              │
│    └─────────────────────────────────────┘                              │
│                                                                         │
│    ┌──────────── ARENA PATH ─────────────┐                              │
│    │  live claude + live apr code        │   validates THE SYSTEM       │
│    │  multi-turn loop (max_turns=20)     │   (does apr code solve       │
│    │  per-fixture test-shaped oracle     │    real tasks like claude?)  │
│    │            │                        │                              │
│    │            ▼                        │                              │
│    │      ArenaOutcome { ... }           │                              │
│    │      evidence/phase-{5,6}/          │                              │
│    └─────────────────────────────────────┘                              │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Per FALSIFY-CCPA-019, every Arena verdict requires a fresh bidirectional-sensitivity calibration on file. Codifies the M196-M224 four-bug-stack lesson as a permanent contract gate.

→ Full architecture: book / Architecture at a glance

Empirical highlight — the V1_004 chain (M286–M294)

The most empirically interesting work in CCPA's history. Through M286-M294 we isolated the load-bearing variable behind 0% tool_call emission on Qwen-Coder models:

Hypothesis	Tested via	Outcome
Inference stack quality	M286 KV cache + 3-knob + EOS + clean_chat_output	Necessary fix; not sufficient
Active params count	3B (30B-A3B-MoE) vs 7B (dense 7B-Coder)	Both 0 tool_calls — refuted
MoE vs dense	qwen3_moe vs qwen2	Same pattern — refuted
Few-shot prompt examples	3 concrete `<tool_call>` examples	No pattern shift — refuted
Qwen-Coder finetune family	Smoke-tested non-Coder `Qwen3-30B-A3B-Instruct-2507`	Emitted `{"name":"file_read","input":{...}}` in 20 tokens — confirmed

→ Full empirical narrative: book / The V1_004 chain

Quick start

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # install rust toolkit
bash                    # puts cargo on your path from above install                
make install-tools      # local tools matching CI exactly
make install-hooks      # FALSIFY-CCPA-012 pre-commit hook
make tier3              # full local gate sweep (fmt + clippy + tests + coverage + comply + pv)

CLI (static path)

# Score a single teacher/student pair
ccpa diff fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl \
          fixtures/canonical/0001-edit-readme/student.ccpa-trace.jsonl

# Score the whole corpus + bidirectional-sensitivity check
ccpa corpus fixtures/canonical/             # canonical MUST PASS
ccpa corpus fixtures/regression/            # regression MUST FAIL

# Walk the parity-matrix coverage gate
ccpa coverage \
  --apr-code-parity-yaml ../aprender/contracts/apr-code-parity-v1.yaml \
  --fixtures-dir fixtures/canonical/

# Validate a JSONL trace against the schema
ccpa validate fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl

Live benches (operator-dispatched)

Require claude login + apr code + a local GGUF model.

bash scripts/phase-3-bench.sh              # function-scale MultiPL-E-Rust HumanEval
bash scripts/phase-5-arena-bench.sh        # project-scale Arena (5 real GitHub-issue fixtures)
bash scripts/phase-5-calibration-bench.sh  # calibration-and-scale (15 deterministic fixtures, M242)
bash scripts/phase-6-bench.sh              # under-contract dispatch (20 fixtures, pmat comply per turn)

→ Full CLI reference: book / CLI

Falsification gates

20 gates, all pv validate-mechanically asserted on every PR per CLAUDE.md § "DOGFOOD pv, NEVER bash".

Source-of-truth invariants (M0+):

ID	Name	Mechanism
`FALSIFY-CCPA-009`	`ci_main_branch_green`	branch protection requires `ci/gate`
`FALSIFY-CCPA-010`	`pmat_comply_100pct`	`pmat comply check`: `is_compliant=true` ∧ 0 Fail-status checks
`FALSIFY-CCPA-011`	`line_coverage_100pct`	`cargo llvm-cov`: 100% functions ∧ ≥99% lines (refined v0.4.0)
`FALSIFY-CCPA-012`	`pv_contract_gate_on_commit`	pre-commit hook + CI run `pv validate` + `pin-check`

Behavioral parity gates (full list in book / Gates):

CCPA-001 (trace_schema_roundtrip) · CCPA-002 (replay_determinism) · CCPA-004 (tool_call_equivalence) · CCPA-005 (file_mutation_equivalence) · CCPA-006 (sovereignty_on_replay) · CCPA-007 (corpus_coverage, HARD-BLOCKING) · CCPA-013 (first_recorded_parity_score, DISCHARGED) · CCPA-014 (os_event_parity_bound) · CCPA-015 (os_trace_output_purity) · CCPA-016 (outcome_parity_bound) · CCPA-017–020 (PROPOSED)

Source-of-truth split

Concern	Lives in
Contract TEXT	`paiml/aprender/contracts/claude-code-parity-apr-v1.yaml` (canonical) — pinned here via `contracts/pin.lock`
Spec	`docs/specifications/claude-code-parity-apr-poc.md` (canonical here since M1)
Implementation, fixtures, CI, coverage, pmat-comply	this repo (canonical)

This split follows aprender's monorepo single-source-of-truth policy: aprender stays canonical for contract TEXT (where every paiml contract lives), while this repo is canonical for runtime ENFORCEMENT.

Adding a fixture

mkdir fixtures/canonical/00XX-my-scenario

cat > fixtures/canonical/00XX-my-scenario/meta.toml <<EOF
[fixture]
id = "00XX-my-scenario"
covers = ["builtin-tools-rwegs"]   # or hooks, skills, slash-commands, ...
description = "What this fixture exercises and why."
EOF

# Author the paired teacher.ccpa-trace.jsonl + student.ccpa-trace.jsonl

ccpa corpus fixtures/canonical/                            # MUST exit 0
ccpa coverage --apr-code-parity-yaml ... --oos-rows ...    # MUST exit 0
make tier3                                                 # full local gate sweep

→ Full fixture reference: book / Fixtures

Workspace

crates/
├── ccpa-trace/       # JSONL trace schema, types, validators
├── ccpa-differ/      # per-tool equivalence rules, parity score
├── ccpa-recorder/    # stream-json parser (claude side)
├── ccpa-subproc/     # subprocess driver (deterministic stdout/stderr capture)
├── ccpa-replayer/    # mock harness for replay determinism
├── ccpa-arena/       # multi-turn live runner + bench binary
└── ccpa-cli/         # `ccpa` user-facing binary

docs/specifications/  # 25 spec files (all <500 LOC, doc-drift gated)
evidence/             # per-phase measured-output snapshots
fixtures/             # canonical, regression, project-scale, calibration-and-scale, under-contract
book/                 # mdBook source for paiml.github.io/claude-code-parity-apr

Academic basis

Paper	Informs
Hinton et al., 1503.02531 — Distilling the Knowledge in a Neural Network	action-stream distillation framing
Segura et al., 2208.08227 — MultiPL-E	function-scale outcome-parity benchmark
Jimenez et al., 2310.06770 — SWE-bench	project-scale Arena corpus design
METTLE, LLMORPH	metamorphic relations on action streams
2207.11976 — Differential Testing of DL Frameworks	differential-testing scoring function
2505.03096 — Chaos Engineering for LLM Systems	regression-corpus design

→ Full per-gate mapping: book / Academic basis · spec section in docs/specifications/academic-basis.md

License

Apache-2.0 OR MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 340 Commits
.github/workflows		.github/workflows
.pmat-work/PMAT-683		.pmat-work/PMAT-683
book		book
contracts		contracts
crates		crates
docs		docs
evidence		evidence
fixtures		fixtures
scripts		scripts
.gitignore		.gitignore
.pmat-metrics.toml		.pmat-metrics.toml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md
clippy.toml		clippy.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

claude-code-parity-apr

📖 Read the book

At a glance

How it works

Empirical highlight — the V1_004 chain (M286–M294)

Quick start

CLI (static path)

Live benches (operator-dispatched)

Falsification gates

Source-of-truth split

Adding a fixture

Workspace

Academic basis

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

claude-code-parity-apr

📖 Read the book

At a glance

How it works

Empirical highlight — the V1_004 chain (M286–M294)

Quick start

CLI (static path)

Live benches (operator-dispatched)

Falsification gates

Source-of-truth split

Adding a fixture

Workspace

Academic basis

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages