Skip to content

Commit 086facd

Browse files
noahgiftclaude
andauthored
docs(M79): record forward_qwen3_moe_cuda_traced — M-MOE-SUB-2 cascade COMPLETE (#65)
aprender PR #1523 MERGED on aprender main 2026-05-06 as squash 690a835c4. Companion-only spec record. **M-MOE-SUB-2 step (b) — GPU traced sibling, completing the M-MOE-SUB-2 cascade end-to-end.** Mirrors the CPU traced sibling forward_qwen3_moe_traced_with_plan (M74) but routes per-layer MoE FFN through the GPU dispatch (moe_ffn_forward_layer_cuda_with_router from M78) so `apr trace --gpu --json --payload --save-tensor` can run the same SaveTensorPlan against both CPU and GPU forward paths, capture per-stage activations at MoeRouter + MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf source. Production forward_qwen3_moe_cuda hot path byte-unchanged (additive-purity invariant pinned in v1.1.0). M-MOE-SUB-2 cascade now 5/5 COMPLETE: - CPU helper (M68) ✓ - CPU traced (M74) ✓ - CLI dispatch (M77) ✓ - GPU helper (M78) ✓ - GPU traced (M79) ✓ Operationally unblocks M-MOE-SUB-3 live bisection on lambda-vector RTX 4090: an operator can now run apr trace --gpu --save-tensor moe_router,moe_ffn_out \ --save-tensor-layers 0..48 --save-tensor-dir <gpu_dir> \ <qwen3_moe_gguf> apr trace --save-tensor moe_router,moe_ffn_out \ --save-tensor-layers 0..48 --save-tensor-dir <cpu_dir> \ <qwen3_moe_gguf> apr diff --values <cpu_dir>/layer-N/moe_ffn_out.bin \ <gpu_dir>/layer-N/moe_ffn_out.bin per layer to find the first stage where GPU produces NaN/Inf. Once bisected, M-GPU-MOE-1.4 fix lands at the bisected stage. Cross-reference bumps: - README status block: M0–M78 → M0–M79 - CONTRIBUTING status footer: M0–M78 → M0–M79 - Spec status header (line 5): M0–M78 → M0–M79 - Spec status snapshot (line 311): M0–M78 → M0–M79 - Run history Run 1 end-M (line 751): M1–M78 → M1–M79 Drift detector PASS — sub-milestones tail M79, gate count 13, contract v1.23.0, corpus 30/30. Refs: aprender PR #1523 (squash 690a835c4), M68 CPU helper, M74 CPU traced, M77 CLI, M78 GPU helper Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 81d1df8 commit 086facd

3 files changed

Lines changed: 6 additions & 5 deletions

File tree

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -255,7 +255,7 @@ is the project blueprint. Major behavioral changes update both:
255255
1. The contract's `status_history` (factual, machine-readable record).
256256
2. The spec markdown (narrative + milestone roll-up).
257257

258-
Status as of v1.23.0 (2026-05-06): M0–M78 all SHIPPED; corpus complete
258+
Status as of v1.23.0 (2026-05-06): M0–M79 all SHIPPED; corpus complete
259259
(30/30); 13/13 gates green; companion ↔ aprender round-trip
260260
mechanically guarded. **M32d numerical-parity FUNCTIONALLY DISCHARGED**
261261
2026-05-02 (aprender PR #1228 squash 5235aaeb9): output transition

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ walks the two traces, applies per-tool semantic-equivalence rules,
2121
and emits a falsifiable parity score plus a closed-enum drift
2222
category for any mismatch.
2323

24-
**Status (2026-05-06)**: M0–M78 all SHIPPED. Contract at v1.23.0
24+
**Status (2026-05-06)**: M0–M79 all SHIPPED. Contract at v1.23.0
2525
ACTIVE_RUNTIME. Corpus complete at the spec-prescribed 30 fixtures
2626
(all score 1.0). Parity-matrix coverage 15/15 reachable
2727
(2 OOS at trace boundary). FALSIFY-CCPA-007 hard-blocking on every PR

docs/specifications/claude-code-parity-apr-poc.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Version**: 1.23.0
44
**Date**: 2026-05-02
5-
**Status**: ACTIVE_RUNTIME — M0–M78 SHIPPED; **GPU MoE forward path P0 / HIGHEST PRIORITY** (M49 elevation, R10) with M-GPU-MOE-0 contract scaffold SHIPPED 2026-05-04 (M50, aprender PR #1453 squash `cf08e910f`); M-GPU-MOE-1.0 → 1.1.2 cascade ALL MERGED 2026-05-04 (M51, aprender PRs #1460 + #1462 + #1464 + #1469 + #1477 squash `dc6f94d3b`); M-GPU-MOE-1.2 SHIPPED + 2.0 + 2.3 stacked OPEN (M52 + M53, aprender PR #1484 squash 8cbb7b51e + #1485 OPEN 3-commit + #1487 + #1488 stacked-merged); M-MOE-SUB-1 SaveTensorStage extension SHIPPED 2026-05-05 (M60, aprender PR #1499 squash `b51986641`); trace-moe-gpu-sub-stages-v1 v1.0.0 → v1.1.0 traced-target clarification SHIPPED 2026-05-05 (M64, aprender PR #1503 squash `8c4c6d5c7`); **M-MOE-SUB-2 step (c) helper `moe_ffn_forward_layer_with_router` SHIPPED 2026-05-05 (M68, aprender PR #1507 squash `0f22c7841`)**; **M-MOE-SUB-2 step (a) `forward_qwen3_moe_traced_with_plan` SHIPPED 2026-05-05 (M74, aprender PR #1516 squash `3138d134d`) — wires MoeRouter + MoeFfnOut emit into CPU traced**; **`pv lint --strict-test-binding` mechanical drift-class prevention SHIPPED 2026-05-05 (M71, aprender PR #1511 squash `ff2e0b634`) — closes the entire M65/M66/M67/M70 manual-fix class via PV-VER-002**; §50.4 cascade §55 polymorphic preflight relaxation + §56 5g.1 LIVE smoke + ship-two-models §57 drift sweep + apr-pretrain-arch-polymorphic-v1 v1.4 + v1.5 + v1.6 + apr-pretrain-from-init-v1 v1.2 + apr-cli-tokenize-import-hf-v1 v1.1 SHIPPED 2026-05-05 (M61–M63 + M65–M67 + M69 + M70, aprender PRs #1500/#1501/#1502/#1504/#1505/#1506/#1508/#1509); M32d numerical-parity FUNCTIONALLY DISCHARGED 2026-05-02 (aprender PR #1228 squash 5235aaeb9); qwen3-moe-forward-v1 v1.3.0 DRAFT → v1.4.0 ACTIVE_ALGORITHM_LEVEL flipped on aprender main 2026-05-02T14:57Z (PR #1409 squash 3a2f2705b)
5+
**Status**: ACTIVE_RUNTIME — M0–M79 SHIPPED; **GPU MoE forward path P0 / HIGHEST PRIORITY** (M49 elevation, R10) with M-GPU-MOE-0 contract scaffold SHIPPED 2026-05-04 (M50, aprender PR #1453 squash `cf08e910f`); M-GPU-MOE-1.0 → 1.1.2 cascade ALL MERGED 2026-05-04 (M51, aprender PRs #1460 + #1462 + #1464 + #1469 + #1477 squash `dc6f94d3b`); M-GPU-MOE-1.2 SHIPPED + 2.0 + 2.3 stacked OPEN (M52 + M53, aprender PR #1484 squash 8cbb7b51e + #1485 OPEN 3-commit + #1487 + #1488 stacked-merged); M-MOE-SUB-1 SaveTensorStage extension SHIPPED 2026-05-05 (M60, aprender PR #1499 squash `b51986641`); trace-moe-gpu-sub-stages-v1 v1.0.0 → v1.1.0 traced-target clarification SHIPPED 2026-05-05 (M64, aprender PR #1503 squash `8c4c6d5c7`); **M-MOE-SUB-2 step (c) helper `moe_ffn_forward_layer_with_router` SHIPPED 2026-05-05 (M68, aprender PR #1507 squash `0f22c7841`)**; **M-MOE-SUB-2 step (a) `forward_qwen3_moe_traced_with_plan` SHIPPED 2026-05-05 (M74, aprender PR #1516 squash `3138d134d`) — wires MoeRouter + MoeFfnOut emit into CPU traced**; **`pv lint --strict-test-binding` mechanical drift-class prevention SHIPPED 2026-05-05 (M71, aprender PR #1511 squash `ff2e0b634`) — closes the entire M65/M66/M67/M70 manual-fix class via PV-VER-002**; §50.4 cascade §55 polymorphic preflight relaxation + §56 5g.1 LIVE smoke + ship-two-models §57 drift sweep + apr-pretrain-arch-polymorphic-v1 v1.4 + v1.5 + v1.6 + apr-pretrain-from-init-v1 v1.2 + apr-cli-tokenize-import-hf-v1 v1.1 SHIPPED 2026-05-05 (M61–M63 + M65–M67 + M69 + M70, aprender PRs #1500/#1501/#1502/#1504/#1505/#1506/#1508/#1509); M32d numerical-parity FUNCTIONALLY DISCHARGED 2026-05-02 (aprender PR #1228 squash 5235aaeb9); qwen3-moe-forward-v1 v1.3.0 DRAFT → v1.4.0 ACTIVE_ALGORITHM_LEVEL flipped on aprender main 2026-05-02T14:57Z (PR #1409 squash 3a2f2705b)
66
**Source of truth**: https://github.com/paiml/claude-code-parity-apr (canonical for enforcement; aprender mirrors only the contract YAML byte-for-byte via `pin.lock`)
77
**Companion-repo invariants** (must be green on every PR — see § Companion-repo source-of-truth invariants):
88
1. GitHub Actions `ci/gate` green (required status check) → **FALSIFY-CCPA-009**
@@ -308,7 +308,7 @@ The teacher's *fixtures* are immutable per-revision; the student (`apr code` orc
308308

309309
## Phases / Milestones
310310

311-
> **Status snapshot (2026-05-06)**: M0–M78 SHIPPED. M32d
311+
> **Status snapshot (2026-05-06)**: M0–M79 SHIPPED. M32d
312312
> **FUNCTIONALLY DISCHARGED** 2026-05-02 via aprender PR #1228 squash
313313
> 5235aaeb9 (Step 5 + 5b + 6 + 7 fix bundle). Output transition on
314314
> lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder-30B-
@@ -493,6 +493,7 @@ in `contracts/claude-code-parity-apr-v1.yaml § status_history`:
493493
| **M76** | Cross-repo **v0.32.0 publish-cascade SHIPPED on aprender main** as 2 squashes: `0bb94d5d3` (2026-05-05, aprender PR #1518) + `cb20a3648` (2026-05-05, aprender PR #1519). Combined record covering the final two PRs of the v0.32.0 publish cascade (aprender#1514). **#1518 fix**: `apr-cli/src/commands/aliases.rs:13` had `include_str!("../../../../configs/aliases.yaml")` referencing the workspace-root file; `cargo publish` excludes files outside the crate dir, breaking the publish step. Fix copies `aliases.yaml` into the crate dir + updates `include_str!` path. **#1519 chore**: CHANGELOG.md gains a `## [0.32.0] - 2026-05-05` section under `## [Unreleased]` documenting the breaking aprender-rag lib rename (#1510, #1512), the cascade publish (#1514) at v0.32.0 across 15 user-facing crates, and the mechanical-drift kaizen wins (M71 + M73 publish-hygiene cascade). Together M72 + #1515 + M75 + M76 (covering #1518 + #1519) close the v0.32.0 release cycle: lib-rename → dep-cycle break → clean-room compat → publish-include-path → CHANGELOG. **Publish-cascade lessons** (kaizen): (1) cargo publish excludes files outside crate-root → require all `include_str!` paths under `crates/<name>/`; (2) clean-room sed strips `path` but needs a `version` fallback → use `{ version = "*", path = "..." }` dev-dep form; (3) APR-MONO consolidation gaps (lib-name harmonization swept most crates but missed aprender-rag) ripple at publish time only — would benefit from a CI gate that runs `cargo publish --dry-run` for every workspace member. | aprender [#1518](https://github.com/paiml/aprender/pull/1518) `0bb94d5d3` + [#1519](https://github.com/paiml/aprender/pull/1519) `cb20a3648` | this PR |
494494
| **M77** | Cross-repo **§58 ship-two-models v3.02.0 → v3.03.0 + M-MOE-SUB-2 step (a) CLI completion SHIPPED on aprender main** as 2 squashes: `8525008f6` (2026-05-06, aprender PR #1520) + `c63a8dd61` (2026-05-06, aprender PR #1521). **#1520 §58**: ship-two-models spec v3.02.0 → v3.03.0 records the v0.32.0 cascade publish (Issue #1514 CLOSED) and the four release-engineering defects the cascade surfaced+closed (publish-include-path, dep-cycle, clean-room sed, CHANGELOG) — third hygiene amendment in ship-two-models, mirroring M65/M66/M67/M70 pretrain-contract drift fixes on the spec side. **#1521 M-MOE-SUB-2 step (a) CLI completion**: connects the `--save-tensor` / `--save-tensor-layers` / `--save-tensor-dir` clap surface (PR-A #1405) through to `forward_qwen3_moe_traced_with_plan` (M74) for `.gguf` qwen3_moe models. New pub fn `run_save_tensor_gguf_moe(path, stages, dir, layers)` in `crates/apr-cli/src/commands/trace_save_tensor.rs` mirrors the existing `run_save_tensor_apr` for APR models — loads via `MappedGGUFModel`/`OwnedQuantizedModel`, validates qwen3_moe arch, reads MoE config from GGUF metadata, dispatches to `forward_qwen3_moe_traced_with_plan` with the plan derived from CLI args. Dispatch wireup in `dispatch.rs::dispatch_diagnostic_commands` routes `.gguf` to the new function (`.apr` continues to use existing dense path; `.safetensors` still stub). **Operationally unblocks M-MOE-SUB-3 live bisection on lambda-vector RTX 4090** (CPU-traced side): `apr trace --save-tensor moe_router,moe_ffn_out --save-tensor-layers 0..48 --save-tensor-dir <dir> <qwen3_moe_gguf>` now produces per-layer MoeRouter + MoeFfnOut tensor files on disk, ready for diff vs the GPU sibling output once M-MOE-SUB-2 step (b) ships. Production hot paths (`forward_qwen3_moe`, `forward_qwen3_moe_cuda`) byte-unchanged; `forward_qwen3_moe_traced` (no-plan) public API unchanged via the M74 delegate pattern. | aprender [#1520](https://github.com/paiml/aprender/pull/1520) `8525008f6` + [#1521](https://github.com/paiml/aprender/pull/1521) `c63a8dd61` | this PR |
495495
| **M78** | Cross-repo **`moe_ffn_forward_layer_cuda_with_router` GPU helper SHIPPED on aprender main** as squash `7e2091967` (2026-05-06, aprender PR #1522). **GPU parallel of M-MOE-SUB-2 step (c)** — adds the sibling helper that returns both the FFN output AND the post-renormalize top-k router weights from `OwnedQuantizedModelCuda::moe_ffn_forward_layer_cuda`. Where M68 added `moe_ffn_forward_layer_with_router` for the CPU side (used by M74's `forward_qwen3_moe_traced_with_plan`), M78 adds the GPU mirror needed by the upcoming `forward_qwen3_moe_cuda_traced` (M-MOE-SUB-2 step (b), next PR). The helper enables capturing `MoeRouter` for the last token without (a) recomputing the router from scratch (drift risk between production and traced), or (b) modifying the production `moe_ffn_forward_layer_cuda` hot path (additive-purity invariant pinned in v1.1.0). With M78 + M68 + M77 (CLI wireup), the M-MOE-SUB-2 cascade now has: CPU helper (M68) ✓ + CPU traced wireup (M74) ✓ + CLI dispatch (M77) ✓ + GPU helper (M78) ✓ — only step (b) `forward_qwen3_moe_cuda_traced` GPU sibling remains before M-MOE-SUB-3 live bisection on RTX 4090 can compare CPU vs GPU traced outputs to find the first NaN-emitting stage. | aprender [#1522 MERGED](https://github.com/paiml/aprender/pull/1522) `7e2091967` | this PR |
496+
| **M79** | Cross-repo **`forward_qwen3_moe_cuda_traced` SHIPPED on aprender main** as squash `690a835c4` (2026-05-06, aprender PR #1523). **M-MOE-SUB-2 step (b) — GPU traced sibling, completing the M-MOE-SUB-2 cascade end-to-end.** Mirrors the CPU traced sibling `forward_qwen3_moe_traced_with_plan` (M74) but routes per-layer MoE FFN through the GPU dispatch (`moe_ffn_forward_layer_cuda_with_router` from M78) so `apr trace --gpu --json --payload --save-tensor` can run **the same SaveTensorPlan** against both CPU and GPU forward paths, capture per-stage activations at MoeRouter + MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf source. Production `forward_qwen3_moe_cuda` hot path byte-unchanged (additive-purity invariant pinned in v1.1.0). **M-MOE-SUB-2 cascade now 5/5 COMPLETE**: CPU helper (M68) + CPU traced (M74) + CLI dispatch (M77) + GPU helper (M78) + GPU traced (M79). **Operationally unblocks M-MOE-SUB-3 live bisection on lambda-vector RTX 4090**: an operator can now run `apr trace --gpu --save-tensor moe_router,moe_ffn_out --save-tensor-layers 0..48 --save-tensor-dir <gpu_dir> <qwen3_moe_gguf>` (GPU side) + `apr trace --save-tensor moe_router,moe_ffn_out --save-tensor-layers 0..48 --save-tensor-dir <cpu_dir> <qwen3_moe_gguf>` (CPU side) on the cached 17.3 GB Qwen3-Coder GGUF, then `apr diff --values <cpu_dir>/layer-N/moe_ffn_out.bin <gpu_dir>/layer-N/moe_ffn_out.bin` per layer to find the first stage where GPU produces NaN/Inf. Once bisected, M-GPU-MOE-1.4 fix lands at the bisected stage. | aprender [#1523 MERGED](https://github.com/paiml/aprender/pull/1523) `690a835c4` | this PR |
496497

497498
## Falsification conditions (13 gates total)
498499

@@ -747,7 +748,7 @@ inverts the schedule for everything after.
747748
| Run | Date | Revision | Verdict | Notes |
748749
|-----|------|----------|---------|-------|
749750
| Run 0 | 2026-04-26 | original spec PR | **NOT YET RUN** (historical) | Spec authored; companion repo not yet scaffolded; gates 009–012 not yet wired. |
750-
| Run 1 | 2026-04-26 → 2026-05-06 | M1–M78 (every merge to companion main) | **PASS** on every commit | Gates 009–012 (ci/gate green, pmat comply 100%, line coverage ≥99%, pv validate clean) have been hard-blocking on every PR since M1's empty-scaffold landed (FALSIFY-CCPA-009 enforces branch protection from that PR forward). M32d FUNCTIONALLY DISCHARGED 2026-05-02 (M35 audit-trail bump records aprender PR #1228 squash 5235aaeb9); subsequent days (5-03 through 5-06) show ongoing post-discharge verification — `make smoke-m32d` PASS on every check, drift detector + meta-test green at M47. Per-run audit trail lives in `contracts/claude-code-parity-apr-v1.yaml § status_history` (one entry per minor-version bump). |
751+
| Run 1 | 2026-04-26 → 2026-05-06 | M1–M79 (every merge to companion main) | **PASS** on every commit | Gates 009–012 (ci/gate green, pmat comply 100%, line coverage ≥99%, pv validate clean) have been hard-blocking on every PR since M1's empty-scaffold landed (FALSIFY-CCPA-009 enforces branch protection from that PR forward). M32d FUNCTIONALLY DISCHARGED 2026-05-02 (M35 audit-trail bump records aprender PR #1228 squash 5235aaeb9); subsequent days (5-03 through 5-06) show ongoing post-discharge verification — `make smoke-m32d` PASS on every check, drift detector + meta-test green at M47. Per-run audit trail lives in `contracts/claude-code-parity-apr-v1.yaml § status_history` (one entry per minor-version bump). |
751752

752753
(Subsequent runs append below in the apr-cli-qa-spec.md format: gate / status / evidence per row. The status_history block in the contract YAML is the byte-precise audit; this table is the human roll-up.)
753754

0 commit comments

Comments
 (0)