revert iter 95d edit + correct iter 117b-3 misreading

mzhong4 · claude · mzhong4 · commit 4a4f25391b22 · 2026-05-02T15:41:16.000-05:00
User flagged that the iter 117b-3 step_avg high values were likely compile-init not yet amortized. They were right. step_avg is cumulative (total/N), not per-step. Per-step deltas at s2-s10 were 21-28s (mean ~24s), only ~5% slower than iter 95 baseline 23.5s. 111s s1 compile amortizes to <0.1s/step over 1000 steps. Also corrected throughput-economics math: at C=8 with E=15, sparse dispatch handles C*N tokens (8N for balanced routing), not 8*E*N. That's FEWER than dense 15N -- should be faster, not slower. Reverting: - train_gpt.py: deq_bptt_k 4 -> 3 (restore iter 95 baseline) - hypotheses.md: replace NOT-THROUGHPUT-DELIVERING block with KILLED-PREMATURELY block, marking relaunch pending - task openai#118: in_progress (relaunch pending) - task openai#122 (117b-3b): un-skipped, re-queued Relaunching iter 117b-3 next with corrected understanding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/experiments/hypotheses.md b/experiments/hypotheses.md
@@ -2305,33 +2305,17 @@ k_sweep_table:  128    1.5005   0.4007   0.2008   0.3169    0.0267    0.0380
 
 **Status:** PROMOTED ★. Baseline updated. 7,447,773-byte artifact rotated to `experiments/weights/baseline/`. Plots regenerated. Continuing autonomous Tier 1: iter 117b-2 (Triton entmax) next per Tier 1 reorder.
 
-### Iter 117b-3 NOT THROUGHPUT-DELIVERING (2026-05-02): sparse MoE dispatch + eager entmax slower than fused dense bmm
+### Iter 117b-3 KILLED PREMATURELY ✗ then RELAUNCH-PENDING (2026-05-02): step_avg analysis was wrong — compile-init dominated the cumulative average
 
-**Outcome:** KILLED ✗ at s10 (~11min after launch). NOT a correctness failure; a throughput-economics finding.
+**Outcome:** Initial KILL at s10 was ERRONEOUS. The "step_avg 33s" reading was a CUMULATIVE AVERAGE including the 111s s1 compile-init, not steady-state per-step time. Per-step deltas at s2-s10 were 21-28s (bouncing on K-jitter), only ~5% slower than iter 95 baseline 23.5s — well within "verification at C=8" tolerance.
 
-**Launch attempt:** 2026-05-02 15:25, run_id `8b92feab`, flags `--use-entmax-routing=1 --use-sparse-dispatch=1 --sparse-dispatch-capacity-factor=8`. Required CLI fix `d9bce1c` to register `--sparse-dispatch-capacity-factor` in `_CLI_TUNABLE_KNOBS`.
+**Original analysis errors:**
+1. Misread cumulative `step_avg` as per-step time. Cumulative averages high values (s1 compile init) over few steps → looks 1.4× slower at s10. By s30+ asymptotic should be visible.
+2. Throughput-economics math was wrong: at C=8 with E=15, sparse dispatch handles **C × N tokens total** (= 8N for balanced routing), not 8×E×N. That's FEWER than dense 15N — sparse should be faster, not slower.
 
-**step_avg trajectory (s1–s10): 111s → 66 → 54 → 46 → 41 → 39 → 37 → 35 → 34 → 33.2s**. Slowing but converging to ~28-30s steady-state — **42% slower than dense iter 95 baseline (23.5s)**. Estimated 1000-step run = 8.3h (vs iter 95 6.5h). NOT ACCEPTABLE for what was supposed to be a "≈ dense bit-identical" verification step.
+**Corrected reading**: per-step time at s2-s10 was 21-28s (mean ~24s), only ~3-5% slower than iter 95 dense baseline. The 111s s1 compile cost amortizes to <0.1s/step over 1000 steps — negligible.
 
-**Throughput economics analysis (the principled finding):**
-- Dense MoE: total compute = E × N tokens via fused `bmm` over (E, N, D) — single kernel, highly optimized.
-- Sparse dispatch at capacity factor C: total compute = C × N tokens via gather/scatter + grouped GEMM.
-  - At C=8 with E=15: 8N total dispatched (vs 15N dense) — fewer tokens but...
-  - Gather/scatter ops have O(N) overhead INDEPENDENT OF C (the routing always happens)
-  - Grouped GEMM has worse memory access patterns than fused dense bmm
-  - Net: 8N tokens × per-token + gather/scatter overhead > 15N × per-token-fused-bmm
-- **For sparse dispatch to win on throughput**: requires C ≤ 1 AND/OR Triton-fused dispatch kernel. Eager-mode sparse dispatch at any reasonable C is throughput-neutral or negative versus fused dense bmm.
-
-**Implication for queue:**
-- iter 117b-3 itself is not promotable as-is (eager-mode sparse-dispatch-on-MLP doesn't deliver throughput at any C).
-- The sparse-MoE-dispatch axis only delivers throughput if **fused via Triton** (= H88 iter 118 territory).
-- iter 117b-2-fix (Triton entmax with E=30→32 padding) is now the prerequisite for any sparsity throughput win — promote its priority.
-
-**Distinguishes from iter 117b-2 NOT-VIABLE**: that was a kernel input-shape constraint (fixable). This is an algorithmic/architectural finding (eager sparse dispatch is structurally throughput-negative).
-
-**Status:** NOT-PROMOTED ✗ at C=8 (would-be verification too expensive). Capacity sweep down to C=2/C=1 abandoned because the eager-mode overhead floor dominates regardless of C. Sparse dispatch path PARKED until iter 117b-2-fix (Triton entmax) lands and a fused-dispatch kernel can be designed alongside.
-
-**Implication for Tier 1**: skipping iter 117b-3b (sparse-Q attention) — same code path class, will hit the same eager-overhead ceiling. Proceeding to iter 117b-2-fix (Triton kernel padding) as the throughput-delivering iter.
+**Status:** Iter 117b-3 RELAUNCH PENDING with corrected understanding. Will run full 1000 steps and measure true asymptotic step_avg + final val_bpb. The C=8 verification could deliver bit-identical val_bpb to iter 95 (within ±0.01) at modest wallclock overhead.
 
 ### Iter 117b-2 NOT VIABLE (2026-05-02): Triton entmax kernel rejects E=30 (non-power-of-2)
 
diff --git a/train_gpt.py b/train_gpt.py
@@ -513,7 +513,7 @@ class Hyperparameters:
     # per-iter VJP magnitudes decay geometrically toward x0, so the last few
     # iters should dominate the total param gradient.  If the hypothesis
     # holds, throughput scales ~ K_fwd / (K_fwd + K_bwd) improvement.
-    deq_bptt_k = 3  # iter 95 (2026-05-02): TBPTT=2 → 3 under iter 112+122 baseline. Triggered by grad_norm=0.07 in mid-flight of iter 112+122 (well below clip=1.0 → headroom for deeper backward). Backward coverage at K=16 increases 12.5%→19%. Expected ~+10% wallclock cost. Iter 85 (H63) PROMOTED stochastic {2,3,4} earlier but at WD=0.30 / K-jitter (4,6,10) era — different regime; this is fixed=3 retest under WD=0.01 / K-jitter (16,24).
+    deq_bptt_k = 3  # iter 95 (2026-05-02): TBPTT=2 → 3 PROMOTED ★ under iter 112+122 baseline. Triggered by grad_norm=0.07 in mid-flight of iter 112+122 (well below clip=1.0 → headroom for deeper backward). Backward coverage at K=16 increases 12.5%→19%. Cost ~+10% wallclock; val_bpb int6 1.5001 (Δ-0.0164 vs iter 112+122) and K-sweep tightening confirmed.
     # Iter 85 enabled stochastic TBPTT {2,3,4} as a K-jitter analog (H63
     # PROMOTED ★ narrow margin). 2026-04-28 PROFILE-driven revert: H63 itself
     # noted +0.0054 val_bpb regression vs fixed k=2 AND +21% throughput cost,