Top spec: claude-code-parity-apr-poc.md | Completeness assessment | Risks (R11)
The completeness assessment puts Axis 2 (real differential test against Claude Code) at ~30%. M113 records the operator-prompted brainstorm of 5 closure paths and selects (2) → (3) as the recommended sequence. This is a planning amendment; concrete implementation work lands in subsequent milestones.
The original M0 vision (Phase 1 RECORD via HTTPS proxy at ANTHROPIC_BASE_URL) was rescoped OOS at M2.3 ("we will not call api, we will assume claude code"). Since then, the harness validates the meter (the differ + scorer) against AUTHORED canonical fixtures, but the system under test (does apr code really match Claude Code on a never-before-seen prompt?) has no live evidence. M111 raised this as R11; M113 proposes concrete closure paths.
Resurrect Phase 1 RECORD at ANTHROPIC_BASE_URL. Run Claude Code against a curated prompt corpus → mitm-style proxy captures API trace + tool round-trips at message granularity → produces real teacher.ccpa-trace.jsonl. The existing M3 RecordedDriver replays against apr code; the existing differ scores it.
M222 operator-directive: this path is DE-PRIORITIZED. The operator has clarified that CCPA should drive claude via session-based auth (claude login) ONLY — no ANTHROPIC_API_KEY, no direct API calls, no per-call dollar cost. Idea (1) requires an API key + budget by construction (the proxy intercepts and re-issues /v1/messages requests), which conflicts with the directive. Idea (2) (CLI subprocess instrumentation, SHIPPED via M136-M141) is the canonical CCPA path; the Phase 3 outcome bench (M150+) and Phase 5 Arena (M194-M210) both run on top of the same claude CLI subprocess pattern with zero API-key dependency. Idea (1) is preserved here for archaeology + future-optional consideration if a use case ever arises that ONLY a proxy can serve (e.g. live API-trace inspection at the wire level), but is not on any active roadmap.
| Aspect | Detail |
|---|---|
| Proof level | Highest — same surface as the 13 gates currently use, but with real teacher input. |
| Cost | ~3-7 days aprender-side (proxy authoring; deepclaude provides working reference implementation at M118) + Anthropic API key + budget. LlmDriver visibility — already satisfied. The real blocker was a feature-flag config in apr-cli/Cargo.toml, addressed in aprender#1638. Locally workaroundable; not gating on the upstream ticket.) |
| Score impact | Closes Axis 2 to ~70-80%. |
| Blockers | (a) ANTHROPIC_BASE_URL IS overridable in production. The rescope was operational (API budget), not technical (auth-pin). For CCPA's pure passthrough-and-log use case (no transformation to a different backend), the cost is reduced to log + passthrough only; (b) LlmDriver pub(crate) → pub upstream |
| Prior art | deepclaude — open-source proxy on localhost:3200 intercepts /v1/messages from Claude Code; passes through everything else; exposes /_proxy/cost for token-stream tracking; supports mid-session backend switch via slash command. CCPA's ccpa-recorder crate (currently scaffolding) can adopt this pattern verbatim. Gotchas inherited: MCP server tools and image/vision input do not survive transformation through Anthropic-compatible compatibility layers — for CCPA's pure-passthrough use case, those are non-issues. Remote-control sessions (hardcoded bridge.claudeusercontent.com WebSocket) are NOT interceptable by ANTHROPIC_BASE_URL — out-of-scope for any RECORD path. |
Run both Claude Code and apr code as subprocesses on the same prompt + same git checkout. Wrap with strace / inotify / file-mutation + shell-exec interceptor. Compare action streams at the OS-event level.
| Aspect | Detail |
|---|---|
| Proof level | Lower granularity than (1) — we lose tool_use_id correlation; we gain "what actually happened to the filesystem". |
| Cost | ~3-5 days; subprocess wrappers + trace post-processor. |
| Score impact | ~50-60% — narrower lens than (1) but immediately actionable without upstream blockers. |
| Blockers | Claude Code CLI binary access (just needs the user to have it installed). |
Feed both systems a curated corpus of real GitHub issues (SWE-bench has 2,294 verified). For each: (a) Claude Code's solution diff; (b) apr code's solution diff; (c) compare via files-touched Jaccard + tests-passing-after-patch + semantic-patch-equivalence (per arXiv:2310.06770). Score: % of issues both solve identically + % both pass the hidden test suite.
| Aspect | Detail |
|---|---|
| Proof level | End-to-end utility — proves apr code can fix real bugs Claude Code can also fix. The strongest "the user got the same result" claim. |
| Cost | 1-2 weeks; SWE-bench harness exists, adapting both systems takes time. Each issue: 5-30 min wall clock. |
| Score impact | 60-70%; complements (1)/(2) on a capability axis rather than action-equivalence. |
| Blockers | Disk + GPU time for 2294 × 2 runs (filterable to a subset, e.g., SWE-bench-Lite at 300). |
Define ~10 metamorphic relations per METTLE / LLMORPH:
- Same prompt twice → same action multiset (replay determinism)
- Prompt with renamed identifiers → same actions modulo rename
- Reordered tool-call dependencies → same final state
- Prompt + extra context that doesn't change the task → same actions
- Permuted file-read order → same edit sequence
- Equivalent natural-language paraphrases → same patches
Run both systems on a held-out corpus of 100 prompts; assert that both satisfy the same relations. PASS criterion: relation-survival-rate matches between systems within ε.
| Aspect | Detail |
|---|---|
| Proof level | Principled and doesn't require ground truth; weaker than (1) but captures "are they two implementations of the same algorithm?" |
| Cost | ~1 week; existing arXiv basis already cited at academic-basis.md. |
| Score impact | ~50%; complements (2) by validating behavioral invariants. |
| Blockers | Defining the relations precisely; choosing ε. |
Don't try to compare per-prompt action equivalence. Instead, over a corpus of N prompts (N ≥ 100), capture distributions: tool-name histogram, session length, tool-calls-per-prompt mean+std, time-to-first-action, error-recovery rate. Compute Wasserstein-1 / KL-divergence between Claude Code's distribution and apr code's. PASS = each metric below a threshold.
| Aspect | Detail |
|---|---|
| Proof level | Population-level "feels similar" claim. Doesn't catch per-prompt divergence but catches systematic skew. |
| Cost | ~3 days; trivially parallelizable; cheap. |
| Score impact | ~30-40% on its own; useful as a cheap continuous health check paired with (1) or (3). |
| Blockers | None. |
(2) → (3): ship the CLI subprocess trace harness first (cheap, ~3-5 days, immediately usable on any prompt without API access), THEN layer on SWE-bench (1-2 weeks, ground-truth utility). That gets Axis 2 to ~60-70% without needing the upstream LlmDriver-public ticket OR an Anthropic API budget.
(1) stays the gold standard but is gated on those two upstream concerns. (4) and (5) are good to add LATER as cheap continuous-health checks once a real teacher source exists.
Idea (2) decomposes into:
| Sub-milestone | Deliverable | Estimate |
|---|---|---|
| M115.1 | New crate crates/ccpa-subproc/ with binary ccpa-trace-subproc <cmd> [args...] that runs cmd under strace -e trace=open*,write,unlink,exec* + inotifywait on $CWD; emits a .ccpa-trace.jsonl of OS-level actions (file_open, file_write, file_unlink, exec). |
~2 days |
| M115.2 | ccpa-trace-subproc claude-code -p "<prompt>" > teacher.jsonl and ccpa-trace-subproc apr code -p "<prompt>" > student.jsonl smoke-test on a tiny corpus (5 fixtures). |
~1 day |
| M115.3 | Extend ccpa-differ with a new OS-level differ mode that operates on the OS-event trace (vs the API-level trace today). New DriftCategory::OsLevelMismatch variants. |
~2 days |
| M115.4 | Falsifier FALSIFY-CCPA-014 (NEW gate): ccpa-trace-subproc-parity-on-curated-corpus. Asserts that for a curated os-fixtures/ corpus (initially 5 prompts), the OS-level action streams of Claude Code + apr code diverge by less than threshold T. T to be calibrated empirically (probably tool-name multiset Jaccard ≥ 0.6 initially; tighten as we learn). |
~2 days |
| M115.5 | Companion contract bump claude-code-parity-apr-v1 v1.23.0 → v1.24.0 adding FALSIFY-CCPA-014 to the gate registry. M22 paired-mirror push to aprender. |
~half-day |
Total: ~7-8 days for M115.1-M115.5; ships Axis 2 from ~30% to ~50% (CLI subprocess instrumentation working end-to-end on a small corpus). M115+ extends to SWE-bench differential evaluation per (3), gating Axis 2 to ~60-70%.
- M113 is a planning amendment only. No code changes; no fixture changes; no contract bump. Future M114 implements idea (2).
- The existing 13 gates remain valid as meter validation against AUTHORED fixtures. M114's FALSIFY-CCPA-014 is a NEW gate at a NEW granularity (OS-level events, not API messages). The existing gates don't get downgraded; they stay the source of truth for the meter.
- Choosing (2) over (1) is a pragmatic ordering, not a permanent rejection of (1). Idea (1) (HTTPS proxy) is still the gold standard and should land when
LlmDriver-publicdoes.
- Top spec: claude-code-parity-apr-poc.md
- Honest 3-axis breakdown: completeness-assessment.md
- R11 risk row (raised at M111): risks.md
- Architecture (the rescoped Phase 1): architecture.md § "Original Phase 1 rationale — now historical"
- Academic basis for ideas (2)+(3)+(4): arXiv:1807.10453 (METTLE), 2207.11976 (differential testing), 2310.06770 (SWE-bench), 2603.23611 (LLMORPH); see academic-basis.md.