Skip to content

Latest commit

 

History

History
32 lines (25 loc) · 6.09 KB

File metadata and controls

32 lines (25 loc) · 6.09 KB

Academic basis (arXiv → gate mapping)

Top spec: claude-code-parity-apr-poc.md | Falsification conditions | Scope extensions

Per-paper mapping to the gates (CCPA-001..013) and sub-extensions (numerical-parity work) it grounds. Hinton 2015 (1503.02531) provides the distillation framing; SWE-bench (2310.06770) the corpus methodology; METTLE (1807.10453) + LLMORPH (2603.23611) the metamorphic-relations framework for tool-call equivalence.

Academic basis (arXiv → gate mapping)

arXiv Title (short) Applies to gate(s) Why
1503.02531 Hinton et al., Distilling the Knowledge in a Neural Network CCPA-008 The parity_score is a distillation loss on the action stream. Framing gates the convergence criterion.
1807.10453 Segura et al., METTLE — Metamorphic Testing of ML Systems CCPA-004, CCPA-005 Tool-call equivalence and post-state equivalence are metamorphic relations under the operational semantics of each tool — the canonical way to test ML systems without ground truth.
2207.11976 Differential Testing of DL Frameworks CCPA-002, CCPA-004 Teacher↔student parity is the textbook differential-testing setup; replay determinism is the precondition.
2310.06770 Jimenez et al., SWE-bench CCPA-007 Justifies fixture-corpus methodology: ≥1 task per capability row, recorded once, replayed forever, no live network.
2505.03096 Chaos Engineering for LLM Systems CCPA-006 Sovereignty enforcement under fault injection — "what if the env leaks an Anthropic key on a replay run" is exactly a chaos-test class.
2603.23611 LLMORPH — Cataloged Metamorphic Relations for NLP CCPA-004 191 catalogued metamorphic relations directly populate per-tool equality rules in tool_equivalence_rules.
2102.05351 (referenced in apr-cli-qa-spec) Coverage-completeness invariants CCPA-010, CCPA-011 Background for 100 %-coverage / 100 %-comply invariants.
2605.03546 Yang et al., ProgramBench — Can Language Models Rebuild Programs From Scratch? CCPA-016 (M152 outcome-parity gate); future CCPA-017 project-scale outcome parity Validates the M154 test-survival methodology at full-project scale: 200 real-world programs (FFmpeg / SQLite / PHP interpreter) where LMs must reconstruct codebases from executable + docs, scored via agent-generated behavioral-equivalence tests. Headline empirical finding: 0% of tasks fully resolved; best model 95%-test-pass on only 3% of tasks. Validates Phase 3 outcome-parity-results.md "what this does NOT prove" caveats (project-scale architecture reconstruction is OOS for M150-M154's function-level POC). Authoring methodology — coverage-guided fuzzing for behavioral test generation — is the natural M158+ extension of our phase-3-test-survival.sh swap pattern.

Scope-extension citations (M32 numerical-parity work)

arXiv Title (short) Applies to Why
1701.06538 Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer qwen3-moe-forward-v1 (Sub-extension 1+3) Original gated-router top-k MoE formulation. Defines the softmax-then-top-k semantics the moe-router-v1 and moe-expert-dispatch-v1 contracts compose, and the per-expert SwiGLU dispatch the M32c.2.2.* chain implements.
2101.03961 Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity qwen3-moe-forward-v1 (Sub-extension 1+2) Modern MoE forward-path conventions (per-token routing, per-expert FFN, weighted aggregation) — the algorithmic shape that forward_qwen3_moe mirrors for Qwen3-Coder-30B-A3B.
2202.09368 Zoph et al., ST-MoE: Designing Stable and Transferable Sparse Expert Models qwen3-moe-forward-v1 (Sub-extension 3) Router-stability instrumentation precedent: per-token router-output logging, top-k expert ids, and per-expert contribution norms. The diagnostic surface M32d Step 4 was scoped to add.
1910.07467 Zhang & Sennrich, Root Mean Square Layer Normalization qwen3-moe-forward-v1 (Sub-extension 1, M32d Step 5) Per-head Q/K RMSNorm was empirically load-bearing in M32d (rank-3 prior, 15 % of FAST PATH cost). The contract's qkv_q_norm + qkv_k_norm equations follow this paper's RMSNorm definition.
2104.09864 Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding qwen3-moe-forward-v1 (Sub-extension 1, M32d Step 5b) Rotary Position Embedding (RoPE). M32d Step 5b's rope_theta 10K → 1M default for qwen3_moe/qwen3 arches stems from this paper's θ scaling for long-context positional encoding (rank-4 prior, 10 % of FAST PATH cost).
2210.17323 Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers qwen3-moe-forward-v1 (Sub-extension 1, formal cosine gate) Quantization-aware reference-comparison framework: cosine ≥ 0.99 vs FP16 ground truth at the LM-head logits is the cross-implementation parity criterion the formal flip of v1.4.0 ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME relied on. DISCHARGED 2026-05-09 at M109: cos_sim 0.995384 ≥ 0.99 measured on lambda-vector RTX 4090; aprender contract flipped to v1.5.0 ACTIVE_RUNTIME via aprender PR #1597 squash 3fb04ef86.
2305.18398 Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning qwen3-moe-forward-v1 (Sub-extension 2) Fused-kernel parity discipline: GPU MoE kernel for forward_qwen3_moe_gpu should preserve numerical equivalence with the CPU LAZY-FUSED-MATVEC path, validated via the same cosine ≥ 0.99 gate (Sub-extension 1).
2305.05176 Aminabadi et al., DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale qwen3-moe-forward-v1 (Sub-extension 2) Sparse-MoE GPU dispatch precedent. Production-grade forward_qwen3_moe_gpu will need expert-parallel scheduling analogous to DeepSpeed-MoE's expert-slot routing.