Top spec: claude-code-parity-apr-poc.md | Falsification conditions | Scope extensions
Per-paper mapping to the gates (CCPA-001..013) and sub-extensions (numerical-parity work) it grounds. Hinton 2015 (1503.02531) provides the distillation framing; SWE-bench (2310.06770) the corpus methodology; METTLE (1807.10453) + LLMORPH (2603.23611) the metamorphic-relations framework for tool-call equivalence.
| arXiv | Title (short) | Applies to gate(s) | Why |
|---|---|---|---|
| 1503.02531 | Hinton et al., Distilling the Knowledge in a Neural Network | CCPA-008 | The parity_score is a distillation loss on the action stream. Framing gates the convergence criterion. |
| 1807.10453 | Segura et al., METTLE — Metamorphic Testing of ML Systems | CCPA-004, CCPA-005 | Tool-call equivalence and post-state equivalence are metamorphic relations under the operational semantics of each tool — the canonical way to test ML systems without ground truth. |
| 2207.11976 | Differential Testing of DL Frameworks | CCPA-002, CCPA-004 | Teacher↔student parity is the textbook differential-testing setup; replay determinism is the precondition. |
| 2310.06770 | Jimenez et al., SWE-bench | CCPA-007 | Justifies fixture-corpus methodology: ≥1 task per capability row, recorded once, replayed forever, no live network. |
| 2505.03096 | Chaos Engineering for LLM Systems | CCPA-006 | Sovereignty enforcement under fault injection — "what if the env leaks an Anthropic key on a replay run" is exactly a chaos-test class. |
| 2603.23611 | LLMORPH — Cataloged Metamorphic Relations for NLP | CCPA-004 | 191 catalogued metamorphic relations directly populate per-tool equality rules in tool_equivalence_rules. |
| 2102.05351 (referenced in apr-cli-qa-spec) | Coverage-completeness invariants | CCPA-010, CCPA-011 | Background for 100 %-coverage / 100 %-comply invariants. |
| 2605.03546 | Yang et al., ProgramBench — Can Language Models Rebuild Programs From Scratch? | CCPA-016 (M152 outcome-parity gate); future CCPA-017 project-scale outcome parity | Validates the M154 test-survival methodology at full-project scale: 200 real-world programs (FFmpeg / SQLite / PHP interpreter) where LMs must reconstruct codebases from executable + docs, scored via agent-generated behavioral-equivalence tests. Headline empirical finding: 0% of tasks fully resolved; best model 95%-test-pass on only 3% of tasks. Validates Phase 3 outcome-parity-results.md "what this does NOT prove" caveats (project-scale architecture reconstruction is OOS for M150-M154's function-level POC). Authoring methodology — coverage-guided fuzzing for behavioral test generation — is the natural M158+ extension of our phase-3-test-survival.sh swap pattern. |
| arXiv | Title (short) | Applies to | Why |
|---|---|---|---|
| 1701.06538 | Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer | qwen3-moe-forward-v1 (Sub-extension 1+3) | Original gated-router top-k MoE formulation. Defines the softmax-then-top-k semantics the moe-router-v1 and moe-expert-dispatch-v1 contracts compose, and the per-expert SwiGLU dispatch the M32c.2.2.* chain implements. |
| 2101.03961 | Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity | qwen3-moe-forward-v1 (Sub-extension 1+2) | Modern MoE forward-path conventions (per-token routing, per-expert FFN, weighted aggregation) — the algorithmic shape that forward_qwen3_moe mirrors for Qwen3-Coder-30B-A3B. |
| 2202.09368 | Zoph et al., ST-MoE: Designing Stable and Transferable Sparse Expert Models | qwen3-moe-forward-v1 (Sub-extension 3) | Router-stability instrumentation precedent: per-token router-output logging, top-k expert ids, and per-expert contribution norms. The diagnostic surface M32d Step 4 was scoped to add. |
| 1910.07467 | Zhang & Sennrich, Root Mean Square Layer Normalization | qwen3-moe-forward-v1 (Sub-extension 1, M32d Step 5) | Per-head Q/K RMSNorm was empirically load-bearing in M32d (rank-3 prior, 15 % of FAST PATH cost). The contract's qkv_q_norm + qkv_k_norm equations follow this paper's RMSNorm definition. |
| 2104.09864 | Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding | qwen3-moe-forward-v1 (Sub-extension 1, M32d Step 5b) | Rotary Position Embedding (RoPE). M32d Step 5b's rope_theta 10K → 1M default for qwen3_moe/qwen3 arches stems from this paper's θ scaling for long-context positional encoding (rank-4 prior, 10 % of FAST PATH cost). |
| 2210.17323 | Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | qwen3-moe-forward-v1 (Sub-extension 1, formal cosine gate) | Quantization-aware reference-comparison framework: cosine ≥ 0.99 vs FP16 ground truth at the LM-head logits is the cross-implementation parity criterion the formal flip of v1.4.0 ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME relied on. DISCHARGED 2026-05-09 at M109: cos_sim 0.995384 ≥ 0.99 measured on lambda-vector RTX 4090; aprender contract flipped to v1.5.0 ACTIVE_RUNTIME via aprender PR #1597 squash 3fb04ef86. |
| 2305.18398 | Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | qwen3-moe-forward-v1 (Sub-extension 2) | Fused-kernel parity discipline: GPU MoE kernel for forward_qwen3_moe_gpu should preserve numerical equivalence with the CPU LAZY-FUSED-MATVEC path, validated via the same cosine ≥ 0.99 gate (Sub-extension 1). |
| 2305.05176 | Aminabadi et al., DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale | qwen3-moe-forward-v1 (Sub-extension 2) | Sparse-MoE GPU dispatch precedent. Production-grade forward_qwen3_moe_gpu will need expert-parallel scheduling analogous to DeepSpeed-MoE's expert-slot routing. |