Skip to content

Commit 4a4f253

Browse files
mzhong4claude
andcommitted
revert iter 95d edit + correct iter 117b-3 misreading
User flagged that the iter 117b-3 step_avg high values were likely compile-init not yet amortized. They were right. step_avg is cumulative (total/N), not per-step. Per-step deltas at s2-s10 were 21-28s (mean ~24s), only ~5% slower than iter 95 baseline 23.5s. 111s s1 compile amortizes to <0.1s/step over 1000 steps. Also corrected throughput-economics math: at C=8 with E=15, sparse dispatch handles C*N tokens (8N for balanced routing), not 8*E*N. That's FEWER than dense 15N -- should be faster, not slower. Reverting: - train_gpt.py: deq_bptt_k 4 -> 3 (restore iter 95 baseline) - hypotheses.md: replace NOT-THROUGHPUT-DELIVERING block with KILLED-PREMATURELY block, marking relaunch pending - task openai#118: in_progress (relaunch pending) - task openai#122 (117b-3b): un-skipped, re-queued Relaunching iter 117b-3 next with corrected understanding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 1a033b2 commit 4a4f253

2 files changed

Lines changed: 8 additions & 24 deletions

File tree

experiments/hypotheses.md

Lines changed: 7 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2305,33 +2305,17 @@ k_sweep_table: 128 1.5005 0.4007 0.2008 0.3169 0.0267 0.0380
23052305

23062306
**Status:** PROMOTED ★. Baseline updated. 7,447,773-byte artifact rotated to `experiments/weights/baseline/`. Plots regenerated. Continuing autonomous Tier 1: iter 117b-2 (Triton entmax) next per Tier 1 reorder.
23072307

2308-
### Iter 117b-3 NOT THROUGHPUT-DELIVERING (2026-05-02): sparse MoE dispatch + eager entmax slower than fused dense bmm
2308+
### Iter 117b-3 KILLED PREMATURELY ✗ then RELAUNCH-PENDING (2026-05-02): step_avg analysis was wrong — compile-init dominated the cumulative average
23092309

2310-
**Outcome:** KILLED ✗ at s10 (~11min after launch). NOT a correctness failure; a throughput-economics finding.
2310+
**Outcome:** Initial KILL at s10 was ERRONEOUS. The "step_avg 33s" reading was a CUMULATIVE AVERAGE including the 111s s1 compile-init, not steady-state per-step time. Per-step deltas at s2-s10 were 21-28s (bouncing on K-jitter), only ~5% slower than iter 95 baseline 23.5s — well within "verification at C=8" tolerance.
23112311

2312-
**Launch attempt:** 2026-05-02 15:25, run_id `8b92feab`, flags `--use-entmax-routing=1 --use-sparse-dispatch=1 --sparse-dispatch-capacity-factor=8`. Required CLI fix `d9bce1c` to register `--sparse-dispatch-capacity-factor` in `_CLI_TUNABLE_KNOBS`.
2312+
**Original analysis errors:**
2313+
1. Misread cumulative `step_avg` as per-step time. Cumulative averages high values (s1 compile init) over few steps → looks 1.4× slower at s10. By s30+ asymptotic should be visible.
2314+
2. Throughput-economics math was wrong: at C=8 with E=15, sparse dispatch handles **C × N tokens total** (= 8N for balanced routing), not 8×E×N. That's FEWER than dense 15N — sparse should be faster, not slower.
23132315

2314-
**step_avg trajectory (s1–s10): 111s → 66 → 54 → 46 → 41 → 39 → 37 → 35 → 34 → 33.2s**. Slowing but converging to ~28-30s steady-state — **42% slower than dense iter 95 baseline (23.5s)**. Estimated 1000-step run = 8.3h (vs iter 95 6.5h). NOT ACCEPTABLE for what was supposed to be a "≈ dense bit-identical" verification step.
2316+
**Corrected reading**: per-step time at s2-s10 was 21-28s (mean ~24s), only ~3-5% slower than iter 95 dense baseline. The 111s s1 compile cost amortizes to <0.1s/step over 1000 steps — negligible.
23152317

2316-
**Throughput economics analysis (the principled finding):**
2317-
- Dense MoE: total compute = E × N tokens via fused `bmm` over (E, N, D) — single kernel, highly optimized.
2318-
- Sparse dispatch at capacity factor C: total compute = C × N tokens via gather/scatter + grouped GEMM.
2319-
- At C=8 with E=15: 8N total dispatched (vs 15N dense) — fewer tokens but...
2320-
- Gather/scatter ops have O(N) overhead INDEPENDENT OF C (the routing always happens)
2321-
- Grouped GEMM has worse memory access patterns than fused dense bmm
2322-
- Net: 8N tokens × per-token + gather/scatter overhead > 15N × per-token-fused-bmm
2323-
- **For sparse dispatch to win on throughput**: requires C ≤ 1 AND/OR Triton-fused dispatch kernel. Eager-mode sparse dispatch at any reasonable C is throughput-neutral or negative versus fused dense bmm.
2324-
2325-
**Implication for queue:**
2326-
- iter 117b-3 itself is not promotable as-is (eager-mode sparse-dispatch-on-MLP doesn't deliver throughput at any C).
2327-
- The sparse-MoE-dispatch axis only delivers throughput if **fused via Triton** (= H88 iter 118 territory).
2328-
- iter 117b-2-fix (Triton entmax with E=30→32 padding) is now the prerequisite for any sparsity throughput win — promote its priority.
2329-
2330-
**Distinguishes from iter 117b-2 NOT-VIABLE**: that was a kernel input-shape constraint (fixable). This is an algorithmic/architectural finding (eager sparse dispatch is structurally throughput-negative).
2331-
2332-
**Status:** NOT-PROMOTED ✗ at C=8 (would-be verification too expensive). Capacity sweep down to C=2/C=1 abandoned because the eager-mode overhead floor dominates regardless of C. Sparse dispatch path PARKED until iter 117b-2-fix (Triton entmax) lands and a fused-dispatch kernel can be designed alongside.
2333-
2334-
**Implication for Tier 1**: skipping iter 117b-3b (sparse-Q attention) — same code path class, will hit the same eager-overhead ceiling. Proceeding to iter 117b-2-fix (Triton kernel padding) as the throughput-delivering iter.
2318+
**Status:** Iter 117b-3 RELAUNCH PENDING with corrected understanding. Will run full 1000 steps and measure true asymptotic step_avg + final val_bpb. The C=8 verification could deliver bit-identical val_bpb to iter 95 (within ±0.01) at modest wallclock overhead.
23352319

23362320
### Iter 117b-2 NOT VIABLE (2026-05-02): Triton entmax kernel rejects E=30 (non-power-of-2)
23372321

train_gpt.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -513,7 +513,7 @@ class Hyperparameters:
513513
# per-iter VJP magnitudes decay geometrically toward x0, so the last few
514514
# iters should dominate the total param gradient. If the hypothesis
515515
# holds, throughput scales ~ K_fwd / (K_fwd + K_bwd) improvement.
516-
deq_bptt_k = 3 # iter 95 (2026-05-02): TBPTT=2 → 3 under iter 112+122 baseline. Triggered by grad_norm=0.07 in mid-flight of iter 112+122 (well below clip=1.0 → headroom for deeper backward). Backward coverage at K=16 increases 12.5%→19%. Expected ~+10% wallclock cost. Iter 85 (H63) PROMOTED stochastic {2,3,4} earlier but at WD=0.30 / K-jitter (4,6,10) era — different regime; this is fixed=3 retest under WD=0.01 / K-jitter (16,24).
516+
deq_bptt_k = 3 # iter 95 (2026-05-02): TBPTT=2 → 3 PROMOTED ★ under iter 112+122 baseline. Triggered by grad_norm=0.07 in mid-flight of iter 112+122 (well below clip=1.0 → headroom for deeper backward). Backward coverage at K=16 increases 12.5%→19%. Cost ~+10% wallclock; val_bpb int6 1.5001 (Δ-0.0164 vs iter 112+122) and K-sweep tightening confirmed.
517517
# Iter 85 enabled stochastic TBPTT {2,3,4} as a K-jitter analog (H63
518518
# PROMOTED ★ narrow margin). 2026-04-28 PROFILE-driven revert: H63 itself
519519
# noted +0.0054 val_bpb regression vs fixed k=2 AND +21% throughput cost,

0 commit comments

Comments
 (0)