Skip to content

Latest commit

 

History

History
19 lines (15 loc) · 6.4 KB

File metadata and controls

19 lines (15 loc) · 6.4 KB

Risks & open questions

Top spec: claude-code-parity-apr-poc.md | Completeness assessment | Axis-2 closure plan | Design audit

R1–R11. R1/R2 are OBSOLETE post-M2.3 rescope; R9 FULLY DISCHARGED at M109; R11 raised at M111 to foreground the M2.3 gap. R3-R8 + R10 + R11 remain live. M113 amendment: R11 closure path now scoped at axis-2-closure-plan.md — 5-idea brainstorm with recommended (2)→(3) sequence (CLI subprocess instrumentation + SWE-bench differential evaluation) to move Axis 2 from ~30% to ~70%; idea (1) HTTPS proxy stays the gold standard but upstream-blocked. M118 amendment: R2's technical premise ("Claude Code may pin its own Anthropic auth, refuse ANTHROPIC_BASE_URL override") is independently DISCHARGED by deepclaude — open-source intercepting proxy at localhost:3200 that routes Claude Code traffic to DeepSeek/OpenRouter/Fireworks via the env var. Idea (1)'s historical "(a) revisit M2.3 rescope; (b) LlmDriver-public upstream" blockers reduce to just (b) for the technical-feasibility axis; the rescope was operational, not technical. M192 amendment (operator-authored design-audit.md): introduces a meta-risk — the static, mock-driven RecordedDriver replay infrastructure may not predict live apr code performance on real multi-step engineering tasks. Popperian falsifier: if static fixtures score ≥0.95 (FALSIFY-CCPA-008) AND live ProgramBench scores ~0 (FALSIFY-CCPA-017), the static-fixture approach is FALSIFIED as a convergence predictor. Three tactical shifts proposed: soft-deprecate FALSIFY-CCPA-014 (OS-event parity); pivot to live Arena runner; prioritize error recovery over zero-shot determinism.

Risks & open questions

# Risk / question Mitigation Falsifiable by
R1 Recording the live Anthropic API costs $$ per fixture OBSOLETE post-M2.3 rescope ("we will not call api, we will assume claude code"). Fixtures are now AUTHORED canonical references in fixtures/canonical/. n/a (risk no longer applies) n/a
R2 Claude Code may pin its own Anthropic auth, refuse ANTHROPIC_BASE_URL override OBSOLETE post-M2.3 rescope — recording proxy is OOS. M118 prior-art DISCHARGE: deepclaude is a working open-source ANTHROPIC_BASE_URL-intercepting proxy that routes Claude Code traffic to alternate backends (DeepSeek/OpenRouter/Fireworks) — concrete proof that Claude Code does not pin Anthropic auth. Documented overridable env-vars: ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, ANTHROPIC_DEFAULT_{OPUS,SONNET,HAIKU}_MODEL, CLAUDE_CODE_SUBAGENT_MODEL. Known non-overridable: remote-control bridge (bridge.claudeusercontent.com) — hardcoded WebSocket. n/a (risk no longer applies; technical premise positively DISCHARGED) n/a
R3 Tool-call equivalence for Edit/Write is non-trivial Per-tool equivalence rules in ccpa-differ, contracted in YAML FALSIFY-CCPA-004 — directly
R4 Claude Code roadmap may add tools we don't have in apr code New tools surface as OrchestrationDrift::UnknownToolName FALSIFY-CCPA-004 — directly (gate FAILs until apr-code-parity-v1.yaml flips a row)
R5 New repo conflicts with monorepo single-source-of-truth Companion repo is canonical for enforcement; aprender stays canonical for contract text. pin.lock pins authoritative commit hash FALSIFY-CCPA-012 — pre-commit hook rejects stale pins
R6 apr code's LlmDriver trait may not be public-stable enough for an external repo FULLY DISCHARGED at M162 (2026-05-13) — aprender#1638 MERGED (squash b61b76b4); cargo install apr-cli ships apr code in default build. Empirically discharged earlier at M150 via bilateral bench (agreement = 1.0000 on 5/5 MultiPL-E-Rust HumanEval) using locally-built apr. PMAT-CODE-LLM-DRIVER-PUBLIC-001 ticket (LlmDriver visibility) was a red herring; LlmDriver was already pub. PMAT-CODE-LLM-DRIVER-PUBLIC-001 (turned out to not gate the work); aprender#1638 MERGED 2026-05-13 M3.1 functional equivalent achieved at M150; aprender#1638 formalized shipping at M162
R7 100 % line coverage may produce test-for-coverage's-sake noise on a tiny POC Tradeoff accepted: POC is small (~5 crates), 100 % is achievable. If a function genuinely cannot be covered, the function is unjustified — delete it. FALSIFY-CCPA-011 — directly
R8 pmat comply check --strict may reject patterns aprender itself uses Companion repo is greenfield; we author to comply. If we hit a genuine pmat comply bug, the fix is upstream pmat, not a --allow flag FALSIFY-CCPA-010 — directly
R9 M32d numerical-correctness blocker FULLY DISCHARGED 2026-05-09 at M109 — formal cosine ≥ 0.99 vs HF FP16 PASSED at cos_sim 0.995384 (lambda-vector RTX 4090; apr forward 555ms; apr_argmax = hf_argmax = 3555 " What"). M32d FUNCTIONALLY DISCHARGED 2026-05-02 via aprender PR #1228 squash 5235aaeb9 (Step 5 + 5b + 6 + 7 fix bundle: per-head Q/K RMSNorm + rope_theta default 1M + chat template no-think + traced sync); output transition %%%%%%%%2 + 2 = 4 + multi-domain coherent answers. M34 FAST PATH plan delivered at lucky-case bound (5 PRs / ~6 hours). M109 closed the remaining "formal cosine flip" gap by discovering the FP16 weights had been on disk at /mnt/nvme-raid0/models/Qwen3-Coder-30B-A3B-Instruct/ (57 GB) for ~7 days — the spec's "60 GB HF download" claim was stale. qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL → v1.5.0 ACTIVE_RUNTIME amendment is empirically valid; aprender-side PR follows from this discharge. M34 plan executed; M35 audit-trail recorded the discharge; M108 filed aprender#1584; M109 LIVE-DISCHARGED aprender#1584 on 2026-05-09 (issue CLOSED 2026-05-09T21:19:41Z once aprender PR #1597 squash 3fb04ef86 landed flipping qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL → v1.5.0 ACTIVE_RUNTIME). FALSIFY-QW3-MOE-PARITY-001 (HF FP16 cosine ≥ 0.99) DISCHARGED at M109 (cos 0.9954); FALSIFY-QW3-MOE-PARITY-002 (llama.cpp argmax sanity) deferred — transitive sibling, no longer load-bearing because PARITY-001 directly proved apr_argmax = hf_argmax