Axis-2 closure plan (M113, 2026-05-10)

Top spec: claude-code-parity-apr-poc.md | Completeness assessment | Risks (R11)

The completeness assessment puts Axis 2 (real differential test against Claude Code) at ~30%. M113 records the operator-prompted brainstorm of 5 closure paths and selects (2) → (3) as the recommended sequence. This is a planning amendment; concrete implementation work lands in subsequent milestones.

Why Axis 2 stalled at ~30%

The original M0 vision (Phase 1 RECORD via HTTPS proxy at ANTHROPIC_BASE_URL) was rescoped OOS at M2.3 ("we will not call api, we will assume claude code"). Since then, the harness validates the meter (the differ + scorer) against AUTHORED canonical fixtures, but the system under test (does apr code really match Claude Code on a never-before-seen prompt?) has no live evidence. M111 raised this as R11; M113 proposes concrete closure paths.

Five candidate paths

(1) HTTPS-proxy reinstatement — the M0 gold standard (DE-PRIORITIZED at M222)

Resurrect Phase 1 RECORD at ANTHROPIC_BASE_URL. Run Claude Code against a curated prompt corpus → mitm-style proxy captures API trace + tool round-trips at message granularity → produces real teacher.ccpa-trace.jsonl. The existing M3 RecordedDriver replays against apr code; the existing differ scores it.

M222 operator-directive: this path is DE-PRIORITIZED. The operator has clarified that CCPA should drive claude via session-based auth (claude login) ONLY — no ANTHROPIC_API_KEY, no direct API calls, no per-call dollar cost. Idea (1) requires an API key + budget by construction (the proxy intercepts and re-issues /v1/messages requests), which conflicts with the directive. Idea (2) (CLI subprocess instrumentation, SHIPPED via M136-M141) is the canonical CCPA path; the Phase 3 outcome bench (M150+) and Phase 5 Arena (M194-M210) both run on top of the same claude CLI subprocess pattern with zero API-key dependency. Idea (1) is preserved here for archaeology + future-optional consideration if a use case ever arises that ONLY a proxy can serve (e.g. live API-trace inspection at the wire level), but is not on any active roadmap.

Aspect	Detail
Proof level	Highest — same surface as the 13 gates currently use, but with real teacher input.
Cost	~3-7 days aprender-side (proxy authoring; deepclaude provides working reference implementation at M118) + Anthropic API key + budget. ~~+ needs PMAT-CODE-LLM-DRIVER-PUBLIC-001 to land for the real student side.~~ (M150 finding: M3.1 / PMAT-CODE-LLM-DRIVER-PUBLIC-001 was about `LlmDriver` visibility — already satisfied. The real blocker was a feature-flag config in `apr-cli/Cargo.toml`, addressed in aprender#1638. Locally workaroundable; not gating on the upstream ticket.)
Score impact	Closes Axis 2 to ~70-80%.
Blockers	(a) ~~operator decision to revisit M2.3 rescope~~ — at M118, deepclaude positively DISCHARGES the technical-feasibility doubt: `ANTHROPIC_BASE_URL` IS overridable in production. The rescope was operational (API budget), not technical (auth-pin). For CCPA's pure passthrough-and-log use case (no transformation to a different backend), the cost is reduced to log + passthrough only; (b) ~~`LlmDriver` `pub(crate)` → `pub` upstream~~ — M150 empirically demonstrated this WAS NOT the actual blocker. The real upstream surface is aprender#1638 (feature-flag removal).
Prior art	deepclaude — open-source proxy on `localhost:3200` intercepts `/v1/messages` from Claude Code; passes through everything else; exposes `/_proxy/cost` for token-stream tracking; supports mid-session backend switch via slash command. CCPA's `ccpa-recorder` crate (currently scaffolding) can adopt this pattern verbatim. Gotchas inherited: MCP server tools and image/vision input do not survive transformation through Anthropic-compatible compatibility layers — for CCPA's pure-passthrough use case, those are non-issues. Remote-control sessions (hardcoded `bridge.claudeusercontent.com` WebSocket) are NOT interceptable by `ANTHROPIC_BASE_URL` — out-of-scope for any RECORD path.

(2) CLI subprocess instrumentation — no API needed

Run both Claude Code and apr code as subprocesses on the same prompt + same git checkout. Wrap with strace / inotify / file-mutation + shell-exec interceptor. Compare action streams at the OS-event level.

Aspect	Detail
Proof level	Lower granularity than (1) — we lose `tool_use_id` correlation; we gain "what actually happened to the filesystem".
Cost	~3-5 days; subprocess wrappers + trace post-processor.
Score impact	~50-60% — narrower lens than (1) but immediately actionable without upstream blockers.
Blockers	Claude Code CLI binary access (just needs the user to have it installed).

(3) SWE-bench differential evaluation — utility-grade

Feed both systems a curated corpus of real GitHub issues (SWE-bench has 2,294 verified). For each: (a) Claude Code's solution diff; (b) apr code's solution diff; (c) compare via files-touched Jaccard + tests-passing-after-patch + semantic-patch-equivalence (per arXiv:2310.06770). Score: % of issues both solve identically + % both pass the hidden test suite.

Aspect	Detail
Proof level	End-to-end utility — proves apr code can fix real bugs Claude Code can also fix. The strongest "the user got the same result" claim.
Cost	1-2 weeks; SWE-bench harness exists, adapting both systems takes time. Each issue: 5-30 min wall clock.
Score impact	60-70%; complements (1)/(2) on a capability axis rather than action-equivalence.
Blockers	Disk + GPU time for 2294 × 2 runs (filterable to a subset, e.g., SWE-bench-Lite at 300).

(4) Metamorphic relations — principled invariant survival

Define ~10 metamorphic relations per METTLE / LLMORPH:

Same prompt twice → same action multiset (replay determinism)
Prompt with renamed identifiers → same actions modulo rename
Reordered tool-call dependencies → same final state
Prompt + extra context that doesn't change the task → same actions
Permuted file-read order → same edit sequence
Equivalent natural-language paraphrases → same patches

Run both systems on a held-out corpus of 100 prompts; assert that both satisfy the same relations. PASS criterion: relation-survival-rate matches between systems within ε.

Aspect	Detail
Proof level	Principled and doesn't require ground truth; weaker than (1) but captures "are they two implementations of the same algorithm?"
Cost	~1 week; existing arXiv basis already cited at academic-basis.md.
Score impact	~50%; complements (2) by validating behavioral invariants.
Blockers	Defining the relations precisely; choosing ε.

(5) Statistical behavior-fingerprint divergence

Don't try to compare per-prompt action equivalence. Instead, over a corpus of N prompts (N ≥ 100), capture distributions: tool-name histogram, session length, tool-calls-per-prompt mean+std, time-to-first-action, error-recovery rate. Compute Wasserstein-1 / KL-divergence between Claude Code's distribution and apr code's. PASS = each metric below a threshold.

Aspect	Detail
Proof level	Population-level "feels similar" claim. Doesn't catch per-prompt divergence but catches systematic skew.
Cost	~3 days; trivially parallelizable; cheap.
Score impact	~30-40% on its own; useful as a cheap continuous health check paired with (1) or (3).
Blockers	None.

Recommended sequence

(2) → (3): ship the CLI subprocess trace harness first (cheap, ~3-5 days, immediately usable on any prompt without API access), THEN layer on SWE-bench (1-2 weeks, ground-truth utility). That gets Axis 2 to ~60-70% without needing the upstream LlmDriver-public ticket OR an Anthropic API budget.

(1) stays the gold standard but is gated on those two upstream concerns. (4) and (5) are good to add LATER as cheap continuous-health checks once a real teacher source exists.

Concrete M115 deliverable (proposed; renumbered from M114 at M114-kaizen-sweep)

Idea (2) decomposes into:

Sub-milestone	Deliverable	Estimate
M115.1	New crate `crates/ccpa-subproc/` with binary `ccpa-trace-subproc <cmd> [args...]` that runs `cmd` under `strace -e trace=open,write,unlink,exec` + `inotifywait` on $CWD; emits a `.ccpa-trace.jsonl` of OS-level actions (file_open, file_write, file_unlink, exec).	~2 days
M115.2	`ccpa-trace-subproc claude-code -p "<prompt>" > teacher.jsonl` and `ccpa-trace-subproc apr code -p "<prompt>" > student.jsonl` smoke-test on a tiny corpus (5 fixtures).	~1 day
M115.3	Extend `ccpa-differ` with a new `OS-level differ` mode that operates on the OS-event trace (vs the API-level trace today). New `DriftCategory::OsLevelMismatch` variants.	~2 days
M115.4	Falsifier `FALSIFY-CCPA-014` (NEW gate): `ccpa-trace-subproc-parity-on-curated-corpus`. Asserts that for a curated `os-fixtures/` corpus (initially 5 prompts), the OS-level action streams of Claude Code + `apr code` diverge by less than threshold T. T to be calibrated empirically (probably tool-name multiset Jaccard ≥ 0.6 initially; tighten as we learn).	~2 days
M115.5	Companion contract bump `claude-code-parity-apr-v1` v1.23.0 → v1.24.0 adding FALSIFY-CCPA-014 to the gate registry. M22 paired-mirror push to aprender.	~half-day

Total: ~7-8 days for M115.1-M115.5; ships Axis 2 from ~30% to ~50% (CLI subprocess instrumentation working end-to-end on a small corpus). M115+ extends to SWE-bench differential evaluation per (3), gating Axis 2 to ~60-70%.

What this does NOT do

M113 is a planning amendment only. No code changes; no fixture changes; no contract bump. Future M114 implements idea (2).
The existing 13 gates remain valid as meter validation against AUTHORED fixtures. M114's FALSIFY-CCPA-014 is a NEW gate at a NEW granularity (OS-level events, not API messages). The existing gates don't get downgraded; they stay the source of truth for the meter.
Choosing (2) over (1) is a pragmatic ordering, not a permanent rejection of (1). Idea (1) (HTTPS proxy) is still the gold standard and should land when LlmDriver-public does.

Cross-refs

Top spec: claude-code-parity-apr-poc.md
Honest 3-axis breakdown: completeness-assessment.md
R11 risk row (raised at M111): risks.md
Architecture (the rescoped Phase 1): architecture.md § "Original Phase 1 rationale — now historical"
Academic basis for ideas (2)+(3)+(4): arXiv:1807.10453 (METTLE), 2207.11976 (differential testing), 2310.06770 (SWE-bench), 2603.23611 (LLMORPH); see academic-basis.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Axis-2 closure plan (M113, 2026-05-10)

Why Axis 2 stalled at ~30%

Five candidate paths

(1) HTTPS-proxy reinstatement — the M0 gold standard (DE-PRIORITIZED at M222)

(2) CLI subprocess instrumentation — no API needed

(3) SWE-bench differential evaluation — utility-grade

(4) Metamorphic relations — principled invariant survival

(5) Statistical behavior-fingerprint divergence

Recommended sequence

Concrete M115 deliverable (proposed; renumbered from M114 at M114-kaizen-sweep)

What this does NOT do

Cross-refs

Uh oh!

FilesExpand file tree

axis-2-closure-plan.md

Latest commit

History

axis-2-closure-plan.md

File metadata and controls

Axis-2 closure plan (M113, 2026-05-10)

Why Axis 2 stalled at ~30%

Five candidate paths

(1) HTTPS-proxy reinstatement — the M0 gold standard (DE-PRIORITIZED at M222)

(2) CLI subprocess instrumentation — no API needed

(3) SWE-bench differential evaluation — utility-grade

(4) Metamorphic relations — principled invariant survival

(5) Statistical behavior-fingerprint divergence

Recommended sequence

Concrete M115 deliverable (proposed; renumbered from M114 at M114-kaizen-sweep)

What this does NOT do

Cross-refs