Phase 2 execution plan (M142, 2026-05-12)

Top spec: claude-code-parity-apr-poc.md | Completeness assessment | Axis-2 closure plan (Phase 1) | Risks

Operator directive (2026-05-12): "update spec for this phase and focus all future tasks on it".

Phase shift: M0–M141 SHIPPED Phase 1: Machinery of axis-2-closure-plan idea (2). M142+ opens Phase 2: Execution — running the machinery against real claude-code + apr code binaries to produce the first runtime evidence-based answer to "are we at parity?".

Phase 1 recap — what's done

Sub-milestone	M	PR	Deliverable
M115.1	M136	#122	`crates/ccpa-subproc/` capture binary (strace wrapper, OS-event JSONL)
M115.3	M137	#123	`ccpa-differ::os_event_parity` differ (macro-averaged Jaccard)
M115.2 + M115.4	M139	#125	`fixtures/os-canonical/` + `fixtures/os-regression/` corpus + FALSIFY-CCPA-014 gate test (3/3 GREEN against AUTHORED inputs)
M115.5	M141	#127 (aprender #1624)	Contract v1.24.0 → v1.25.0 — CCPA-014 ACTIVE_RUNTIME in gate registry

Current state (post-M141): 14/14 gates ACTIVE_RUNTIME. 30/30 API-level fixtures + 4 OS-level fixtures all pass against AUTHORED inputs. The machinery to test parity at OS granularity is complete. The machinery has NEVER been pointed at real claude-code or apr code binaries. (M-history note as of M214: gate count has since advanced to 18/18 registered — 16 ACTIVE_RUNTIME + 2 PROPOSED at v1.29.0 (CCPA-017 project-scale + CCPA-018 arena recovery-rate) — across Phase 3 P3.5 (M164/v1.26.0), Phase 3 M167 (v1.27.0), Phase 4 P4.5 (M190/v1.28.0), and Phase 5 M208 (v1.29.0). The "14/14 ACTIVE_RUNTIME" snapshot in this paragraph is the historical state at the Phase 2 opening and is preserved verbatim as archaeology.)

Phase 2 plan — five sub-deliverables (P2.1-P2.5)

Goal: produce the first runtime evidence-based parity measurement between claude-code and apr code. Drive Axis 2 score from ~45% → ~70-80% by replacing AUTHORED gate inputs with MEASURED.

P2.1 — Binary availability check (M142)

What: Verify what's locally accessible.

which claude-code → currently this CLI tool (Claude Code's CLI distribution).
which apr or apr-cli build → built from aprender workspace.

Operator dispatch needed: this companion repo's CI environment does NOT have claude-code installed. Auth model (operator-directive M222): claude uses its own session-based auth via claude login on the operator's host — CCPA does NOT use ANTHROPIC_API_KEY and does NOT call the Anthropic API directly. All benches drive the claude CLI as a subprocess; the CLI handles auth internally. apr-cli requires the aprender workspace at a specific build configuration. P2.1 is a status check and operator-dispatch readiness gate.

Deliverable: a scripts/phase-2-binary-check.sh script that probes both binaries + emits a YAML manifest evidence/phase-2/binaries.yaml recording version + path + ANTHROPIC env-var state. Output goes into evidence/phase-2/ (new directory) for audit-trail.

P2.2 — Prompt corpus authoring (M143)

What: Author 5-10 representative prompts covering the CCPA scenario taxonomy (edit-readme, fix-failing-test, create-new-file, multi-tool, etc.).

Where: fixtures/phase-2-prompts/<id>/prompt.txt + meta.toml per prompt. Mirrors the structure of fixtures/canonical/ but each fixture holds the SOURCE prompt only (capture runs populate the trace JSONL during P2.3).

Constraint: prompts must be self-contained (not require external state) so they're reproducible across captures. Test-fixture trees should be checked in alongside prompt.txt as fixtures/phase-2-prompts/<id>/cwd-tree/ (the cwd the binaries operate against).

Deliverable: 5 initial prompts authored. Each is a single-paragraph instruction simulating a real Claude Code session.

P2.3 — Real capture run (M144+, operator-dispatched)

What: For each prompt × system, run the capture binary and store the JSONL.

for p in fixtures/phase-2-prompts/*/; do
    id=$(basename "$p")
    cp -r "$p/cwd-tree" "/tmp/run-$id/teacher"
    cp -r "$p/cwd-tree" "/tmp/run-$id/student"

    # Teacher run — real Claude Code
    cd "/tmp/run-$id/teacher" && \
      ccpa-trace-subproc claude-code -p "$(cat $p/prompt.txt)" \
        > "$p/teacher.ccpa-os-trace.jsonl"

    # Student run — real apr code
    cd "/tmp/run-$id/student" && \
      ccpa-trace-subproc apr code -p "$(cat $p/prompt.txt)" \
        > "$p/student.ccpa-os-trace.jsonl"
done

Operator dispatch: requires the binaries from P2.1. M146 auth-model amendment: claude CLI uses its own session-based auth (claude login), NOT ANTHROPIC_API_KEY. The operator does NOT need to set the env-var; instead, run claude login once if not already logged in, then dispatch. Wall time: ~30-60 seconds per prompt × 2 systems = ~5-10 minutes total for a 5-prompt corpus.

Deliverable: paired .ccpa-os-trace.jsonl files alongside each prompt's meta.toml, plus a scripts/phase-2-capture.sh runner that automates the loop.

P2.4 — Differential scoring + drift analysis (M145)

What: Run ccpa-differ's os_event_parity() on each pair, aggregate, classify drifts.

ccpa os-corpus fixtures/phase-2-prompts/ --json > evidence/phase-2/measured-os-parity.json

Add a new subcommand to ccpa-cli: ccpa os-corpus <dir> [--json] that walks the phase-2 corpus + emits per-prompt + aggregate scores in the same shape as fixtures/canonical/measured-parity.json. The JSON has:

{
  "fixture_corpus_path": "fixtures/phase-2-prompts/",
  "fixture_count": 5,
  "aggregate_score": 0.72,
  "per_fixture": [
    { "id": "0001-edit-readme", "score": 0.83, "drift_count": 3 },
    ...
  ],
  "teacher_source": "real Claude Code @ <claude-code version> @ <date>",
  "student_source": "real apr code @ <apr-cli sha>"
}

Deliverable: ccpa os-corpus subcommand + first measured-os-parity.json evidence file. This is the first runtime evidence-based parity number — the answer to "are we at parity?".

P2.5 — Drift triage + bug-fix backlog (M146+, indefinite)

What: For every OsDriftCategory record where teacher and student diverge, classify the cause:

Environmental: different libc paths, different ld.so.cache lookups, different locale-archive accesses. Filter out via path-prefix allowlist; not bugs.
Behavioral: different tmp-file naming, different exec patterns, different file-creation order. Real bugs in apr code to fix.
Sovereignty: drifts involving paths the teacher accesses but student doesn't (or vice versa) that suggest the systems are doing different work on the same prompt.

Per behavioral drift: file an aprender issue with the drift record + reproducer prompt. Track in evidence/phase-2/drift-backlog.md.

This is the value Phase 2 delivers: ground-truth bug-fix material — exactly what apr code would need to change to converge with claude-code at the OS level.

Status / progress tracking

Phase 2 milestones (M142+) replace the kaizen-treadmill maintenance loop as the primary focus. Per-sub-deliverable expected wall time:

Sub	Wall time	Operator-dispatched?
P2.1 binary check	~1 hr (script + verify)	partially (binary install)
P2.2 prompt corpus	~2-3 hrs (5 prompts + cwd trees + metas)	NO
P2.3 real capture	~10 min wall + operator wall-clock for review	YES (requires installed binaries + `claude login` session — no API key per M222 operator-directive)
P2.4 scoring + tooling	~3-4 hrs (`ccpa os-corpus` subcommand + first measured-parity.json)	NO
P2.5 drift triage	indefinite (per-drift, ongoing)	partially (filing aprender issues)

Total time for first measured score: ~1 day of companion-repo work + 1 operator-dispatch session for P2.3 capture.

What changes for future kaizen sweeps

Maintenance-cadence work (M-row refresh + counter bumps) continues — the M116 detector pattern still fires on every PR.

Substantive kaizen focus shifts: future PRs should prioritize Phase 2 deliverables over deepclaude-class spec sweeps or other content drift. The deepclaude integration converged at M123 (8 surfaces grep-relative); the contract bump completed at M141; the next high-value direction is producing real measurement evidence per this Phase 2 plan.

Cross-refs

completeness-assessment.md § Are we at parity with Claude Code? — the operator-prompt question that authored this plan
axis-2-closure-plan.md — the Phase 1 brainstorm (M113) that selected idea (2)
risks.md R11 — the Axis 2 risk this plan progressively discharges
milestones-m101-m111.md — M115.1-M115.5 (Phase 1, M136-M141) + M142+ (Phase 2)
aprender PR #1624 squash 29ce2ea3c — v1.25.0 contract with CCPA-014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phase 2 execution plan (M142, 2026-05-12)

Phase 1 recap — what's done

Phase 2 plan — five sub-deliverables (P2.1-P2.5)

P2.1 — Binary availability check (M142)

P2.2 — Prompt corpus authoring (M143)

P2.3 — Real capture run (M144+, operator-dispatched)

P2.4 — Differential scoring + drift analysis (M145)

P2.5 — Drift triage + bug-fix backlog (M146+, indefinite)

Status / progress tracking

What changes for future kaizen sweeps

Cross-refs

Uh oh!

FilesExpand file tree

phase-2-execution-plan.md

Latest commit

History

phase-2-execution-plan.md

File metadata and controls

Phase 2 execution plan (M142, 2026-05-12)

Phase 1 recap — what's done

Phase 2 plan — five sub-deliverables (P2.1-P2.5)

P2.1 — Binary availability check (M142)

P2.2 — Prompt corpus authoring (M143)

P2.3 — Real capture run (M144+, operator-dispatched)

P2.4 — Differential scoring + drift analysis (M145)

P2.5 — Drift triage + bug-fix backlog (M146+, indefinite)

Status / progress tracking

What changes for future kaizen sweeps

Cross-refs