Top spec: claude-code-parity-apr-poc.md
Three sub-extensions formally in-scope per the M31 monorepo clarification. The CCPA POC's harness machinery (Axes 1) is action-stream parity over fixtures; these extensions extend scope to the underlying LLM's numerical correctness (Axis 3 production validation). See completeness-assessment.md for the 3-axis honest breakdown.
The original M0–M6 POC scope above was limited to action-stream parity over a static teacher fixture: the LLM output itself was treated as opaque. M32 (added 2026-04-29) extended scope to include numerical correctness of the underlying inference engine because the AUTHORED-fixture rescope (M2.3) requires apr code to drive a real local LLM rather than a recorded API trace, and a broken local LLM produces gibberish that no parity-score can repair.
Three sub-extensions, prioritized as of M49 (2026-05-04). Each has its own arXiv basis.
Why this is now P0 (M49 priority elevation, 2026-05-04): M32d's CPU-only LAZY-FUSED-MATVEC produces correct output but at ~30 tok/s — ~10× slower than the dense GPU path (Qwen2.5-Coder-7B Q4_K_M at ~225–440 tok/s on RTX 4090 cuBLAS). Every dogfood prompt currently takes 30–60 seconds; make smoke-m32d takes ~3 minutes; full corpus parity runs against this model are infeasible at production scale. The CPU-only constraint is the only thing keeping this POC from drop-in production use of the recommended Qwen3-Coder-30B-A3B-Instruct-Q4_K_M default model. With CPU-MoE in place, the formal cosine ≥0.99 measurement (sub-extension 1) is throughput-bounded; with GPU MoE, both correctness AND production throughput hold.
Status (2026-05-04, post-M52) — historical, superseded by M85-M87; preserved for archaeology: IN PROGRESS, integration code MERGED. The CPU dispatch at OwnedQuantizedModel::forward_qwen3_moe (LAZY-FUSED-MATVEC over per-expert Q4_K/Q6_K row slices in moe_ffn_forward_layer) is now joined by OwnedQuantizedModelCuda::forward_qwen3_moe_cuda (M51, aprender PR #1477 squash dc6f94d3b MERGED on aprender main 2026-05-04) which mirrors the CPU sibling line-for-line and routes per-expert matmuls through CudaExecutor::q4k_matvec/q6k_gemv via the moe_ffn_forward_layer_cuda → expert_swiglu_cuda helpers (M51, aprender PR #1469). The cosine-vs-CPU parity test scaffold is authored at crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs (M52, aprender PR #1484 OPEN, M-GPU-MOE-1.2). The wgpu sibling type OwnedQuantizedModelWgpu is stubbed at crates/aprender-serve/src/gguf/wgpu_backend/mod.rs per the v1.2.0 option I amendment (M52, aprender PR #1485 + #1487 OPEN cascade).
Status (2026-05-06, post-M87) — M-GPU-MOE-1.x CASCADE CLOSED at the algorithm-level surface: the M50-M87 cascade closed all of M-GPU-MOE-1.x on aprender main. Highlights: (a) qwen3-moe-forward-gpu-v1 flipped DRAFT → ACTIVE_ALGORITHM_LEVEL at v1.7.0 (M86, PR #1530 squash 65bc42577); (b) the L6 moe_ffn_out GPU NaN root-caused to mixed Q4_K_M quantization + qtype-unaware GPU dispatch and FIXED at M85 (PR #1529 squash 89cb26af7) via qtype-aware expert_swiglu_cuda mirroring CPU matvec_for_qtype; LIVE-verified ZERO NaN across all 48 layers on gx10 Blackwell GB10 (sm_120) AND lambda-vector RTX 4090 (sm_89); (c) FALSIFY-MOE-SUB-001/002/003/004 ALL DISCHARGED in trace-moe-gpu-sub-stages-v1 v1.6.0 (M87, PR #1531 squash 0bfcbc1ad); (d) FALSIFY-QW3-MOE-GPU-PARITY-001 cosine ≥ 0.99 vs CPU is at ALGORITHM_LEVEL_DISCHARGED — ~85% of layers cos > 0.99; remaining ~7-8 sub-threshold layers (cos 0.94-0.987) are fp-accumulator-order drift between Rust SIMD/rayon (CPU fused_q6k_parallel_matvec) and CUDA warp-shuffle (q6k_gemv), separate M-GPU-MOE-3 territory. Still PENDING: M-GPU-MOE-2.1/2.2/2.3 (wgpu helpers + integration + parity test, blocked on trueno-gpu wgpu surface authoring) and M-GPU-MOE-3 (throughput ≥ 150 tok/s + VRAM ≤ 95% + the kernel-level fp-order alignment that lifts the ~7-8 layers above 0.99). See milestone-table rows M50-M87 for the full audit trail.
Required deliverables (status updated 2026-05-06 post-M87):
aprender-contracts/contracts/qwen3-moe-forward-gpu-v1.yaml— kernel contract (analogous toqwen3-moe-forward-v1but GPU-targeted; FALSIFY-QW3-MOE-GPU-001..00N gate set). SHIPPED v1.0.0 DRAFT (M50, aprender PR #1453) → MERGED through v1.7.0 ACTIVE_ALGORITHM_LEVEL (M86, PR #1530 squash65bc42577) — versioned through the cascade: v1.0 scaffold → v1.1 option D (M51 #1462) → v1.2 option I (M54 squash5a27bb892) → v1.3 (M55 #1490) → v1.4 (M57 #1492) → v1.5 (M84 #1528) → v1.6 (M85 #1529 fix) → v1.7 (M86 #1530 status flip). M-GPU-MOE-1 umbrella stage flipped PENDING → SHIPPED at M86.- CUDA kernel
forward_qwen3_moe_cudaonOwnedQuantizedModelCudainaprender-serve— sparse expert dispatch + per-expert SwiGLU via Q4K/Q6K matmuls, mirroring DeepSpeed-MoE expert-parallel scheduling (arXiv:2305.05176). FULL FORWARD INTEGRATION SHIPPED 2026-05-04 (M51, aprender PR #1477 squashdc6f94d3b); L6moe_ffn_outNaN FIXED 2026-05-06 (M85, PR #1529 squash89cb26af7) via qtype-aware dispatch inexpert_swiglu_cuda; LIVE ZERO NaN on gx10 + RTX 4090. - wgpu fallback for non-CUDA hardware (per CLAUDE.md backend-agnostic mandate). STUB MERGED on aprender main (M54, PR #1485 squash
5a27bb8923-commit bundle including PR #1487a5827f60cand PR #148810cc7ad41); helpers + full integration + parity test still PENDING M-GPU-MOE-2.1/2.2/2.3 (blocked on trueno-gpu wgpu surface authoring). - Cosine-equivalence gate vs the CPU LAZY-FUSED-MATVEC path (FALSIFY-QW3-MOE-GPU-PARITY-001 — same ≥0.99 threshold as the HF FP16 gate from sub-extension 1). TEST SCAFFOLD MERGED 2026-05-04 (M52, PR #1484 squash
8cbb7b51e); heavy harness LIVE-RUN on gx10 Blackwell GB10 — DISCHARGED at ALGORITHM_LEVEL (M85): ~85% of layers cos > 0.99; ~7-8 layers (L7, L9, L12, L20, L23, L29, L46) sit at cos 0.94-0.987 due to fp-accumulator-order drift, separate M-GPU-MOE-3 territory (not a step-c bug). - Throughput target: ≥150 tok/s for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M on RTX 4090 (≥5× CPU baseline of ~30 tok/s; allows headroom below the dense Q4_K target of ~440 tok/s since MoE has expert-dispatch overhead). STILL PENDING M-GPU-MOE-3 — fused dequant+matmul + sparse expert batching + the kernel-level fp-order alignment to lift the ~7-8 sub-threshold layers above 0.99.
Rough estimate: Per CLAUDE.md performance targets and the DeepSpeed-MoE precedent, this is 2–4 weeks of aprender-side engineering for the CUDA path; wgpu can defer.
Why this MUST be P0: Without it, the discharge that the POC just achieved at M32d is not consumable by the very orchestration layer (apr code) the POC exists to validate. Anyone trying to run apr code against the spec-prescribed Qwen3-Coder-30B-A3B will hit the 30 tok/s wall. The action-stream parity machinery (CCPA-001..013, all DISCHARGED) is correct but cannot be exercised at production cadence without GPU MoE.
Academic basis: arXiv:2305.18398 (Dao, FlashAttention-2 — fused-kernel parity discipline), arXiv:2305.05176 (Aminabadi et al., DeepSpeed-MoE — sparse-MoE GPU dispatch / expert-parallel scheduling), arXiv:2101.03961 (Fedus et al., Switch Transformers — modern MoE forward conventions).
qwen3-moe-forward-v1 contract, M32a–M32c.2.2.2.1.4 SHIPPED, M32d FUNCTIONALLY DISCHARGED 2026-05-02. Asserts that apr run produces coherent output on the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M GGUF. Output transition (%%%%%%%% → 2 + 2 = 4 + multi-domain coherent answers) verified live; the formal cosine ≥ 0.99 vs HF FP16 reference flips qwen3-moe-forward-v1 v1.4.0 ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME. Status (2026-05-09, post-M109) — FULLY DISCHARGED: cosine measurement passed at cos_sim = 0.995384 on lambda-vector RTX 4090, with apr_argmax = hf_argmax = 3555 (" What") for the canonical "What is 2+2?" prompt. APR forward took 555ms (single-shot, 7-token prefill); HF FP16 fixture generated in 52s. The previously-cited "60 GB HF FP16 download" blocker turned out to be stale — the FP16 weights had been on lambda-vector at /mnt/nvme-raid0/models/Qwen3-Coder-30B-A3B-Instruct/ (57 GB across 16 safetensors shards) for at least 7 days. The aprender-side qwen3-moe-forward-v1 v1.4.0 → v1.5.0 ACTIVE_RUNTIME amendment is now empirically valid; aprender PR pending.
Academic basis: arXiv:1701.06538 (Shazeer et al., MoE), arXiv:2101.03961 (Fedus et al., Switch Transformers), arXiv:2210.17323 (Frantar et al., GPTQ — quantization-aware reference comparison), arXiv:1910.07467 (Zhang & Sennrich, RMSNorm — empirically load-bearing in M32d Step 5), arXiv:2104.09864 (Su et al., RoPE — load-bearing in M32d Step 5b).
M32d Step 4, bypassed because the rank-3+rank-4 fix discharged the symptom; remains useful for future MoE numerical-correctness investigations (and will be load-bearing for sub-extension 2 GPU-vs-CPU bisection). Extends apr trace --json --payload to emit per-token router output (softmax over experts), top-k expert ids, and per-expert L2 contribution norms.
Academic basis: arXiv:1701.06538 (Shazeer et al. — original gated MoE router), arXiv:2202.09368 (Zoph et al., ST-MoE — router-stability instrumentation precedent).