Top spec: claude-code-parity-apr-poc.md | Completeness assessment | Axis-2 closure plan (Phase 1) | Risks
Operator directive (2026-05-12): "update spec for this phase and focus all future tasks on it".
Phase shift: M0–M141 SHIPPED Phase 1: Machinery of axis-2-closure-plan idea (2). M142+ opens Phase 2: Execution — running the machinery against real claude-code + apr code binaries to produce the first runtime evidence-based answer to "are we at parity?".
| Sub-milestone | M | PR | Deliverable |
|---|---|---|---|
| M115.1 | M136 | #122 | crates/ccpa-subproc/ capture binary (strace wrapper, OS-event JSONL) |
| M115.3 | M137 | #123 | ccpa-differ::os_event_parity differ (macro-averaged Jaccard) |
| M115.2 + M115.4 | M139 | #125 | fixtures/os-canonical/ + fixtures/os-regression/ corpus + FALSIFY-CCPA-014 gate test (3/3 GREEN against AUTHORED inputs) |
| M115.5 | M141 | #127 (aprender #1624) | Contract v1.24.0 → v1.25.0 — CCPA-014 ACTIVE_RUNTIME in gate registry |
Current state (post-M141): 14/14 gates ACTIVE_RUNTIME. 30/30 API-level fixtures + 4 OS-level fixtures all pass against AUTHORED inputs. The machinery to test parity at OS granularity is complete. The machinery has NEVER been pointed at real claude-code or apr code binaries. (M-history note as of M214: gate count has since advanced to 18/18 registered — 16 ACTIVE_RUNTIME + 2 PROPOSED at v1.29.0 (CCPA-017 project-scale + CCPA-018 arena recovery-rate) — across Phase 3 P3.5 (M164/v1.26.0), Phase 3 M167 (v1.27.0), Phase 4 P4.5 (M190/v1.28.0), and Phase 5 M208 (v1.29.0). The "14/14 ACTIVE_RUNTIME" snapshot in this paragraph is the historical state at the Phase 2 opening and is preserved verbatim as archaeology.)
Goal: produce the first runtime evidence-based parity measurement between claude-code and apr code. Drive Axis 2 score from ~45% → ~70-80% by replacing AUTHORED gate inputs with MEASURED.
What: Verify what's locally accessible.
which claude-code→ currently this CLI tool (Claude Code's CLI distribution).which aprorapr-cli build→ built from aprender workspace.
Operator dispatch needed: this companion repo's CI environment does NOT have claude-code installed. Auth model (operator-directive M222): claude uses its own session-based auth via claude login on the operator's host — CCPA does NOT use ANTHROPIC_API_KEY and does NOT call the Anthropic API directly. All benches drive the claude CLI as a subprocess; the CLI handles auth internally. apr-cli requires the aprender workspace at a specific build configuration. P2.1 is a status check and operator-dispatch readiness gate.
Deliverable: a scripts/phase-2-binary-check.sh script that probes both binaries + emits a YAML manifest evidence/phase-2/binaries.yaml recording version + path + ANTHROPIC env-var state. Output goes into evidence/phase-2/ (new directory) for audit-trail.
What: Author 5-10 representative prompts covering the CCPA scenario taxonomy (edit-readme, fix-failing-test, create-new-file, multi-tool, etc.).
Where: fixtures/phase-2-prompts/<id>/prompt.txt + meta.toml per prompt. Mirrors the structure of fixtures/canonical/ but each fixture holds the SOURCE prompt only (capture runs populate the trace JSONL during P2.3).
Constraint: prompts must be self-contained (not require external state) so they're reproducible across captures. Test-fixture trees should be checked in alongside prompt.txt as fixtures/phase-2-prompts/<id>/cwd-tree/ (the cwd the binaries operate against).
Deliverable: 5 initial prompts authored. Each is a single-paragraph instruction simulating a real Claude Code session.
What: For each prompt × system, run the capture binary and store the JSONL.
for p in fixtures/phase-2-prompts/*/; do
id=$(basename "$p")
cp -r "$p/cwd-tree" "/tmp/run-$id/teacher"
cp -r "$p/cwd-tree" "/tmp/run-$id/student"
# Teacher run — real Claude Code
cd "/tmp/run-$id/teacher" && \
ccpa-trace-subproc claude-code -p "$(cat $p/prompt.txt)" \
> "$p/teacher.ccpa-os-trace.jsonl"
# Student run — real apr code
cd "/tmp/run-$id/student" && \
ccpa-trace-subproc apr code -p "$(cat $p/prompt.txt)" \
> "$p/student.ccpa-os-trace.jsonl"
doneOperator dispatch: requires the binaries from P2.1. M146 auth-model amendment: claude CLI uses its own session-based auth (claude login), NOT ANTHROPIC_API_KEY. The operator does NOT need to set the env-var; instead, run claude login once if not already logged in, then dispatch. Wall time: ~30-60 seconds per prompt × 2 systems = ~5-10 minutes total for a 5-prompt corpus.
Deliverable: paired .ccpa-os-trace.jsonl files alongside each prompt's meta.toml, plus a scripts/phase-2-capture.sh runner that automates the loop.
What: Run ccpa-differ's os_event_parity() on each pair, aggregate, classify drifts.
ccpa os-corpus fixtures/phase-2-prompts/ --json > evidence/phase-2/measured-os-parity.jsonAdd a new subcommand to ccpa-cli: ccpa os-corpus <dir> [--json] that walks the phase-2 corpus + emits per-prompt + aggregate scores in the same shape as fixtures/canonical/measured-parity.json. The JSON has:
{
"fixture_corpus_path": "fixtures/phase-2-prompts/",
"fixture_count": 5,
"aggregate_score": 0.72,
"per_fixture": [
{ "id": "0001-edit-readme", "score": 0.83, "drift_count": 3 },
...
],
"teacher_source": "real Claude Code @ <claude-code version> @ <date>",
"student_source": "real apr code @ <apr-cli sha>"
}Deliverable: ccpa os-corpus subcommand + first measured-os-parity.json evidence file. This is the first runtime evidence-based parity number — the answer to "are we at parity?".
What: For every OsDriftCategory record where teacher and student diverge, classify the cause:
- Environmental: different libc paths, different ld.so.cache lookups, different locale-archive accesses. Filter out via path-prefix allowlist; not bugs.
- Behavioral: different tmp-file naming, different exec patterns, different file-creation order. Real bugs in
apr codeto fix. - Sovereignty: drifts involving paths the teacher accesses but student doesn't (or vice versa) that suggest the systems are doing different work on the same prompt.
Per behavioral drift: file an aprender issue with the drift record + reproducer prompt. Track in evidence/phase-2/drift-backlog.md.
This is the value Phase 2 delivers: ground-truth bug-fix material — exactly what apr code would need to change to converge with claude-code at the OS level.
Phase 2 milestones (M142+) replace the kaizen-treadmill maintenance loop as the primary focus. Per-sub-deliverable expected wall time:
| Sub | Wall time | Operator-dispatched? |
|---|---|---|
| P2.1 binary check | ~1 hr (script + verify) | partially (binary install) |
| P2.2 prompt corpus | ~2-3 hrs (5 prompts + cwd trees + metas) | NO |
| P2.3 real capture | ~10 min wall + operator wall-clock for review | YES (requires installed binaries + claude login session — no API key per M222 operator-directive) |
| P2.4 scoring + tooling | ~3-4 hrs (ccpa os-corpus subcommand + first measured-parity.json) |
NO |
| P2.5 drift triage | indefinite (per-drift, ongoing) | partially (filing aprender issues) |
Total time for first measured score: ~1 day of companion-repo work + 1 operator-dispatch session for P2.3 capture.
Maintenance-cadence work (M-row refresh + counter bumps) continues — the M116 detector pattern still fires on every PR.
Substantive kaizen focus shifts: future PRs should prioritize Phase 2 deliverables over deepclaude-class spec sweeps or other content drift. The deepclaude integration converged at M123 (8 surfaces grep-relative); the contract bump completed at M141; the next high-value direction is producing real measurement evidence per this Phase 2 plan.
- completeness-assessment.md § Are we at parity with Claude Code? — the operator-prompt question that authored this plan
- axis-2-closure-plan.md — the Phase 1 brainstorm (M113) that selected idea (2)
- risks.md R11 — the Axis 2 risk this plan progressively discharges
- milestones-m101-m111.md — M115.1-M115.5 (Phase 1, M136-M141) + M142+ (Phase 2)
- aprender PR #1624 squash
29ce2ea3c— v1.25.0 contract with CCPA-014