Skip to content

paiml/claude-code-parity-apr

Repository files navigation

claude-code-parity-apr

CCPA — record-replay-distill harness measuring claude vs apr code

CI Book Read the book License Contract Status Gates Parity Corpus Coverage

A record-replay-distill harness measuring apr code against Claude Code at the action-stream level. Paired teacher/student traces, per-tool semantic-equivalence rules, a falsifiable parity score, and a live multi-turn Arena bench.


📖 Read the book

The full technical companion is hosted at paiml.github.io/claude-code-parity-apr — methodology, falsifier gates, the empirical V1_004 chain, CLI reference, and the academic basis.

# build the book locally
mdbook build book/ && open book/book/index.html

At a glance

Dimension Result n Source
Function-scale outcome parity (HumanEval) 1.0000 5 M150 — evidence/phase-3/multipl-e-rust-scores.json
Function-scale test-survival (cross-swap) 1.0000 10 M154 — evidence/phase-3/test-survival.json
Project-scale Arena (claude) 0.20 (1/5) 5 M234 — evidence/phase-5/arena-scores.json
Project-scale Arena (apr code) 0.00 (0/5) 5 M234 — same
Static-vs-Arena Popperian verdict StaticFalsified M234 design-audit.md §5
Canonical corpus aggregate 1.0000 30 M150 — fixtures/canonical/measured-parity.json
Contract version v1.32.0 aprender-side authoritative
Falsification gates 20/20 registered 16 ACTIVE_RUNTIME + 4 PROPOSED
Sub-milestones shipped M0–M296 all SHIPPED continuous since 2026-04-26

Honest framing: at function-scale, the two systems are functionally interchangeable. At project-scale, the static-fixture approach is Popperian-falsified as a project-scale predictor. Contract at v1.32.0 (aprender-side authoritative). The full 3-axis breakdown is in docs/specifications/completeness-assessment.md.

How it works

CCPA treats Claude Code as the teacher and apr code as the student. Two complementary measurement paths cross-falsify each other:

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│    ┌──────────── STATIC PATH ────────────┐                              │
│    │  AUTHORED teacher.ccpa-trace.jsonl  │   validates THE METER        │
│    │            +                        │   (does the differ           │
│    │  AUTHORED student.ccpa-trace.jsonl  │    recognize equivalence?)   │
│    │            │                        │                              │
│    │            ▼                        │                              │
│    │   ccpa-differ::compute_parity_score │                              │
│    │            │                        │                              │
│    │            ▼                        │                              │
│    │     ParityReport { score, drifts }  │                              │
│    └─────────────────────────────────────┘                              │
│                                                                         │
│    ┌──────────── ARENA PATH ─────────────┐                              │
│    │  live claude + live apr code        │   validates THE SYSTEM       │
│    │  multi-turn loop (max_turns=20)     │   (does apr code solve       │
│    │  per-fixture test-shaped oracle     │    real tasks like claude?)  │
│    │            │                        │                              │
│    │            ▼                        │                              │
│    │      ArenaOutcome { ... }           │                              │
│    │      evidence/phase-{5,6}/          │                              │
│    └─────────────────────────────────────┘                              │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Per FALSIFY-CCPA-019, every Arena verdict requires a fresh bidirectional-sensitivity calibration on file. Codifies the M196-M224 four-bug-stack lesson as a permanent contract gate.

→ Full architecture: book / Architecture at a glance

Empirical highlight — the V1_004 chain (M286–M294)

The most empirically interesting work in CCPA's history. Through M286-M294 we isolated the load-bearing variable behind 0% tool_call emission on Qwen-Coder models:

Hypothesis Tested via Outcome
Inference stack quality M286 KV cache + 3-knob + EOS + clean_chat_output Necessary fix; not sufficient
Active params count 3B (30B-A3B-MoE) vs 7B (dense 7B-Coder) Both 0 tool_calls — refuted
MoE vs dense qwen3_moe vs qwen2 Same pattern — refuted
Few-shot prompt examples 3 concrete <tool_call> examples No pattern shift — refuted
Qwen-Coder finetune family Smoke-tested non-Coder Qwen3-30B-A3B-Instruct-2507 Emitted {"name":"file_read","input":{...}} in 20 tokens — confirmed

→ Full empirical narrative: book / The V1_004 chain

Quick start

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # install rust toolkit
bash                    # puts cargo on your path from above install                
make install-tools      # local tools matching CI exactly
make install-hooks      # FALSIFY-CCPA-012 pre-commit hook
make tier3              # full local gate sweep (fmt + clippy + tests + coverage + comply + pv)

CLI (static path)

# Score a single teacher/student pair
ccpa diff fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl \
          fixtures/canonical/0001-edit-readme/student.ccpa-trace.jsonl

# Score the whole corpus + bidirectional-sensitivity check
ccpa corpus fixtures/canonical/             # canonical MUST PASS
ccpa corpus fixtures/regression/            # regression MUST FAIL

# Walk the parity-matrix coverage gate
ccpa coverage \
  --apr-code-parity-yaml ../aprender/contracts/apr-code-parity-v1.yaml \
  --fixtures-dir fixtures/canonical/

# Validate a JSONL trace against the schema
ccpa validate fixtures/canonical/0001-edit-readme/teacher.ccpa-trace.jsonl

Live benches (operator-dispatched)

Require claude login + apr code + a local GGUF model.

bash scripts/phase-3-bench.sh              # function-scale MultiPL-E-Rust HumanEval
bash scripts/phase-5-arena-bench.sh        # project-scale Arena (5 real GitHub-issue fixtures)
bash scripts/phase-5-calibration-bench.sh  # calibration-and-scale (15 deterministic fixtures, M242)
bash scripts/phase-6-bench.sh              # under-contract dispatch (20 fixtures, pmat comply per turn)

→ Full CLI reference: book / CLI

Falsification gates

20 gates, all pv validate-mechanically asserted on every PR per CLAUDE.md § "DOGFOOD pv, NEVER bash".

Source-of-truth invariants (M0+):

ID Name Mechanism
FALSIFY-CCPA-009 ci_main_branch_green branch protection requires ci/gate
FALSIFY-CCPA-010 pmat_comply_100pct pmat comply check: is_compliant=true ∧ 0 Fail-status checks
FALSIFY-CCPA-011 line_coverage_100pct cargo llvm-cov: 100% functions ∧ ≥99% lines (refined v0.4.0)
FALSIFY-CCPA-012 pv_contract_gate_on_commit pre-commit hook + CI run pv validate + pin-check

Behavioral parity gates (full list in book / Gates):

CCPA-001 (trace_schema_roundtrip) · CCPA-002 (replay_determinism) · CCPA-004 (tool_call_equivalence) · CCPA-005 (file_mutation_equivalence) · CCPA-006 (sovereignty_on_replay) · CCPA-007 (corpus_coverage, HARD-BLOCKING) · CCPA-013 (first_recorded_parity_score, DISCHARGED) · CCPA-014 (os_event_parity_bound) · CCPA-015 (os_trace_output_purity) · CCPA-016 (outcome_parity_bound) · CCPA-017–020 (PROPOSED)

Source-of-truth split

Concern Lives in
Contract TEXT paiml/aprender/contracts/claude-code-parity-apr-v1.yaml (canonical) — pinned here via contracts/pin.lock
Spec docs/specifications/claude-code-parity-apr-poc.md (canonical here since M1)
Implementation, fixtures, CI, coverage, pmat-comply this repo (canonical)

This split follows aprender's monorepo single-source-of-truth policy: aprender stays canonical for contract TEXT (where every paiml contract lives), while this repo is canonical for runtime ENFORCEMENT.

Adding a fixture

mkdir fixtures/canonical/00XX-my-scenario

cat > fixtures/canonical/00XX-my-scenario/meta.toml <<EOF
[fixture]
id = "00XX-my-scenario"
covers = ["builtin-tools-rwegs"]   # or hooks, skills, slash-commands, ...
description = "What this fixture exercises and why."
EOF

# Author the paired teacher.ccpa-trace.jsonl + student.ccpa-trace.jsonl

ccpa corpus fixtures/canonical/                            # MUST exit 0
ccpa coverage --apr-code-parity-yaml ... --oos-rows ...    # MUST exit 0
make tier3                                                 # full local gate sweep

→ Full fixture reference: book / Fixtures

Workspace

crates/
├── ccpa-trace/       # JSONL trace schema, types, validators
├── ccpa-differ/      # per-tool equivalence rules, parity score
├── ccpa-recorder/    # stream-json parser (claude side)
├── ccpa-subproc/     # subprocess driver (deterministic stdout/stderr capture)
├── ccpa-replayer/    # mock harness for replay determinism
├── ccpa-arena/       # multi-turn live runner + bench binary
└── ccpa-cli/         # `ccpa` user-facing binary

docs/specifications/  # 25 spec files (all <500 LOC, doc-drift gated)
evidence/             # per-phase measured-output snapshots
fixtures/             # canonical, regression, project-scale, calibration-and-scale, under-contract
book/                 # mdBook source for paiml.github.io/claude-code-parity-apr

Academic basis

Paper Informs
Hinton et al., 1503.02531Distilling the Knowledge in a Neural Network action-stream distillation framing
Segura et al., 2208.08227MultiPL-E function-scale outcome-parity benchmark
Jimenez et al., 2310.06770SWE-bench project-scale Arena corpus design
METTLE, LLMORPH metamorphic relations on action streams
2207.11976Differential Testing of DL Frameworks differential-testing scoring function
2505.03096Chaos Engineering for LLM Systems regression-corpus design

→ Full per-gate mapping: book / Academic basis · spec section in docs/specifications/academic-basis.md

License

Apache-2.0 OR MIT.

About

Record-replay-distill harness proving `apr code` parity with Claude Code. Contract ACTIVE_RUNTIME v1.2.0 — 13 falsifiable gates, all pv-validated, 199 tests, 100% function coverage. Spec: paiml/aprender#1078

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors