Skip to content

Latest commit

 

History

History
112 lines (78 loc) · 10.9 KB

File metadata and controls

112 lines (78 loc) · 10.9 KB

Axis-2 closure plan (M113, 2026-05-10)

Top spec: claude-code-parity-apr-poc.md | Completeness assessment | Risks (R11)

The completeness assessment puts Axis 2 (real differential test against Claude Code) at ~30%. M113 records the operator-prompted brainstorm of 5 closure paths and selects (2) → (3) as the recommended sequence. This is a planning amendment; concrete implementation work lands in subsequent milestones.

Why Axis 2 stalled at ~30%

The original M0 vision (Phase 1 RECORD via HTTPS proxy at ANTHROPIC_BASE_URL) was rescoped OOS at M2.3 ("we will not call api, we will assume claude code"). Since then, the harness validates the meter (the differ + scorer) against AUTHORED canonical fixtures, but the system under test (does apr code really match Claude Code on a never-before-seen prompt?) has no live evidence. M111 raised this as R11; M113 proposes concrete closure paths.

Five candidate paths

(1) HTTPS-proxy reinstatement — the M0 gold standard (DE-PRIORITIZED at M222)

Resurrect Phase 1 RECORD at ANTHROPIC_BASE_URL. Run Claude Code against a curated prompt corpus → mitm-style proxy captures API trace + tool round-trips at message granularity → produces real teacher.ccpa-trace.jsonl. The existing M3 RecordedDriver replays against apr code; the existing differ scores it.

M222 operator-directive: this path is DE-PRIORITIZED. The operator has clarified that CCPA should drive claude via session-based auth (claude login) ONLY — no ANTHROPIC_API_KEY, no direct API calls, no per-call dollar cost. Idea (1) requires an API key + budget by construction (the proxy intercepts and re-issues /v1/messages requests), which conflicts with the directive. Idea (2) (CLI subprocess instrumentation, SHIPPED via M136-M141) is the canonical CCPA path; the Phase 3 outcome bench (M150+) and Phase 5 Arena (M194-M210) both run on top of the same claude CLI subprocess pattern with zero API-key dependency. Idea (1) is preserved here for archaeology + future-optional consideration if a use case ever arises that ONLY a proxy can serve (e.g. live API-trace inspection at the wire level), but is not on any active roadmap.

Aspect Detail
Proof level Highest — same surface as the 13 gates currently use, but with real teacher input.
Cost ~3-7 days aprender-side (proxy authoring; deepclaude provides working reference implementation at M118) + Anthropic API key + budget. + needs PMAT-CODE-LLM-DRIVER-PUBLIC-001 to land for the real student side. (M150 finding: M3.1 / PMAT-CODE-LLM-DRIVER-PUBLIC-001 was about LlmDriver visibility — already satisfied. The real blocker was a feature-flag config in apr-cli/Cargo.toml, addressed in aprender#1638. Locally workaroundable; not gating on the upstream ticket.)
Score impact Closes Axis 2 to ~70-80%.
Blockers (a) operator decision to revisit M2.3 rescope — at M118, deepclaude positively DISCHARGES the technical-feasibility doubt: ANTHROPIC_BASE_URL IS overridable in production. The rescope was operational (API budget), not technical (auth-pin). For CCPA's pure passthrough-and-log use case (no transformation to a different backend), the cost is reduced to log + passthrough only; (b) LlmDriver pub(crate)pub upstream — M150 empirically demonstrated this WAS NOT the actual blocker. The real upstream surface is aprender#1638 (feature-flag removal).
Prior art deepclaude — open-source proxy on localhost:3200 intercepts /v1/messages from Claude Code; passes through everything else; exposes /_proxy/cost for token-stream tracking; supports mid-session backend switch via slash command. CCPA's ccpa-recorder crate (currently scaffolding) can adopt this pattern verbatim. Gotchas inherited: MCP server tools and image/vision input do not survive transformation through Anthropic-compatible compatibility layers — for CCPA's pure-passthrough use case, those are non-issues. Remote-control sessions (hardcoded bridge.claudeusercontent.com WebSocket) are NOT interceptable by ANTHROPIC_BASE_URL — out-of-scope for any RECORD path.

(2) CLI subprocess instrumentation — no API needed

Run both Claude Code and apr code as subprocesses on the same prompt + same git checkout. Wrap with strace / inotify / file-mutation + shell-exec interceptor. Compare action streams at the OS-event level.

Aspect Detail
Proof level Lower granularity than (1) — we lose tool_use_id correlation; we gain "what actually happened to the filesystem".
Cost ~3-5 days; subprocess wrappers + trace post-processor.
Score impact ~50-60% — narrower lens than (1) but immediately actionable without upstream blockers.
Blockers Claude Code CLI binary access (just needs the user to have it installed).

(3) SWE-bench differential evaluation — utility-grade

Feed both systems a curated corpus of real GitHub issues (SWE-bench has 2,294 verified). For each: (a) Claude Code's solution diff; (b) apr code's solution diff; (c) compare via files-touched Jaccard + tests-passing-after-patch + semantic-patch-equivalence (per arXiv:2310.06770). Score: % of issues both solve identically + % both pass the hidden test suite.

Aspect Detail
Proof level End-to-end utility — proves apr code can fix real bugs Claude Code can also fix. The strongest "the user got the same result" claim.
Cost 1-2 weeks; SWE-bench harness exists, adapting both systems takes time. Each issue: 5-30 min wall clock.
Score impact 60-70%; complements (1)/(2) on a capability axis rather than action-equivalence.
Blockers Disk + GPU time for 2294 × 2 runs (filterable to a subset, e.g., SWE-bench-Lite at 300).

(4) Metamorphic relations — principled invariant survival

Define ~10 metamorphic relations per METTLE / LLMORPH:

  • Same prompt twice → same action multiset (replay determinism)
  • Prompt with renamed identifiers → same actions modulo rename
  • Reordered tool-call dependencies → same final state
  • Prompt + extra context that doesn't change the task → same actions
  • Permuted file-read order → same edit sequence
  • Equivalent natural-language paraphrases → same patches

Run both systems on a held-out corpus of 100 prompts; assert that both satisfy the same relations. PASS criterion: relation-survival-rate matches between systems within ε.

Aspect Detail
Proof level Principled and doesn't require ground truth; weaker than (1) but captures "are they two implementations of the same algorithm?"
Cost ~1 week; existing arXiv basis already cited at academic-basis.md.
Score impact ~50%; complements (2) by validating behavioral invariants.
Blockers Defining the relations precisely; choosing ε.

(5) Statistical behavior-fingerprint divergence

Don't try to compare per-prompt action equivalence. Instead, over a corpus of N prompts (N ≥ 100), capture distributions: tool-name histogram, session length, tool-calls-per-prompt mean+std, time-to-first-action, error-recovery rate. Compute Wasserstein-1 / KL-divergence between Claude Code's distribution and apr code's. PASS = each metric below a threshold.

Aspect Detail
Proof level Population-level "feels similar" claim. Doesn't catch per-prompt divergence but catches systematic skew.
Cost ~3 days; trivially parallelizable; cheap.
Score impact ~30-40% on its own; useful as a cheap continuous health check paired with (1) or (3).
Blockers None.

Recommended sequence

(2) → (3): ship the CLI subprocess trace harness first (cheap, ~3-5 days, immediately usable on any prompt without API access), THEN layer on SWE-bench (1-2 weeks, ground-truth utility). That gets Axis 2 to ~60-70% without needing the upstream LlmDriver-public ticket OR an Anthropic API budget.

(1) stays the gold standard but is gated on those two upstream concerns. (4) and (5) are good to add LATER as cheap continuous-health checks once a real teacher source exists.

Concrete M115 deliverable (proposed; renumbered from M114 at M114-kaizen-sweep)

Idea (2) decomposes into:

Sub-milestone Deliverable Estimate
M115.1 New crate crates/ccpa-subproc/ with binary ccpa-trace-subproc <cmd> [args...] that runs cmd under strace -e trace=open*,write,unlink,exec* + inotifywait on $CWD; emits a .ccpa-trace.jsonl of OS-level actions (file_open, file_write, file_unlink, exec). ~2 days
M115.2 ccpa-trace-subproc claude-code -p "<prompt>" > teacher.jsonl and ccpa-trace-subproc apr code -p "<prompt>" > student.jsonl smoke-test on a tiny corpus (5 fixtures). ~1 day
M115.3 Extend ccpa-differ with a new OS-level differ mode that operates on the OS-event trace (vs the API-level trace today). New DriftCategory::OsLevelMismatch variants. ~2 days
M115.4 Falsifier FALSIFY-CCPA-014 (NEW gate): ccpa-trace-subproc-parity-on-curated-corpus. Asserts that for a curated os-fixtures/ corpus (initially 5 prompts), the OS-level action streams of Claude Code + apr code diverge by less than threshold T. T to be calibrated empirically (probably tool-name multiset Jaccard ≥ 0.6 initially; tighten as we learn). ~2 days
M115.5 Companion contract bump claude-code-parity-apr-v1 v1.23.0 → v1.24.0 adding FALSIFY-CCPA-014 to the gate registry. M22 paired-mirror push to aprender. ~half-day

Total: ~7-8 days for M115.1-M115.5; ships Axis 2 from ~30% to ~50% (CLI subprocess instrumentation working end-to-end on a small corpus). M115+ extends to SWE-bench differential evaluation per (3), gating Axis 2 to ~60-70%.

What this does NOT do

  • M113 is a planning amendment only. No code changes; no fixture changes; no contract bump. Future M114 implements idea (2).
  • The existing 13 gates remain valid as meter validation against AUTHORED fixtures. M114's FALSIFY-CCPA-014 is a NEW gate at a NEW granularity (OS-level events, not API messages). The existing gates don't get downgraded; they stay the source of truth for the meter.
  • Choosing (2) over (1) is a pragmatic ordering, not a permanent rejection of (1). Idea (1) (HTTPS proxy) is still the gold standard and should land when LlmDriver-public does.

Cross-refs