You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
revert iter 95d edit + correct iter 117b-3 misreading
User flagged that the iter 117b-3 step_avg high values were likely
compile-init not yet amortized. They were right.
step_avg is cumulative (total/N), not per-step. Per-step deltas at
s2-s10 were 21-28s (mean ~24s), only ~5% slower than iter 95 baseline
23.5s. 111s s1 compile amortizes to <0.1s/step over 1000 steps.
Also corrected throughput-economics math: at C=8 with E=15, sparse
dispatch handles C*N tokens (8N for balanced routing), not 8*E*N.
That's FEWER than dense 15N -- should be faster, not slower.
Reverting:
- train_gpt.py: deq_bptt_k 4 -> 3 (restore iter 95 baseline)
- hypotheses.md: replace NOT-THROUGHPUT-DELIVERING block with
KILLED-PREMATURELY block, marking relaunch pending
- task openai#118: in_progress (relaunch pending)
- task openai#122 (117b-3b): un-skipped, re-queued
Relaunching iter 117b-3 next with corrected understanding.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**Status:** PROMOTED ★. Baseline updated. 7,447,773-byte artifact rotated to `experiments/weights/baseline/`. Plots regenerated. Continuing autonomous Tier 1: iter 117b-2 (Triton entmax) next per Tier 1 reorder.
2307
2307
2308
-
### Iter 117b-3 NOT THROUGHPUT-DELIVERING (2026-05-02): sparse MoE dispatch + eager entmax slower than fused dense bmm
2308
+
### Iter 117b-3 KILLED PREMATURELY ✗ then RELAUNCH-PENDING (2026-05-02): step_avg analysis was wrong — compile-init dominated the cumulative average
2309
2309
2310
-
**Outcome:**KILLED ✗ at s10 (~11min after launch). NOT a correctness failure; a throughput-economics finding.
2310
+
**Outcome:**Initial KILL at s10 was ERRONEOUS. The "step_avg 33s" reading was a CUMULATIVE AVERAGE including the 111s s1 compile-init, not steady-state per-step time. Per-step deltas at s2-s10 were 21-28s (bouncing on K-jitter), only ~5% slower than iter 95 baseline 23.5s — well within "verification at C=8" tolerance.
2311
2311
2312
-
**Launch attempt:** 2026-05-02 15:25, run_id `8b92feab`, flags `--use-entmax-routing=1 --use-sparse-dispatch=1 --sparse-dispatch-capacity-factor=8`. Required CLI fix `d9bce1c` to register `--sparse-dispatch-capacity-factor` in `_CLI_TUNABLE_KNOBS`.
2312
+
**Original analysis errors:**
2313
+
1. Misread cumulative `step_avg` as per-step time. Cumulative averages high values (s1 compile init) over few steps → looks 1.4× slower at s10. By s30+ asymptotic should be visible.
2314
+
2. Throughput-economics math was wrong: at C=8 with E=15, sparse dispatch handles **C × N tokens total** (= 8N for balanced routing), not 8×E×N. That's FEWER than dense 15N — sparse should be faster, not slower.
2313
2315
2314
-
**step_avg trajectory (s1–s10): 111s → 66 → 54 → 46 → 41 → 39 → 37 → 35 → 34 → 33.2s**. Slowing but converging to ~28-30s steady-state — **42% slower than dense iter 95 baseline (23.5s)**. Estimated 1000-step run = 8.3h (vs iter 95 6.5h). NOT ACCEPTABLE for what was supposed to be a "≈ dense bit-identical" verification step.
2316
+
**Corrected reading**: per-step time at s2-s10 was 21-28s (mean ~24s), only ~3-5% slower than iter 95 dense baseline. The 111s s1 compile cost amortizes to <0.1s/step over 1000 steps — negligible.
2315
2317
2316
-
**Throughput economics analysis (the principled finding):**
2317
-
- Dense MoE: total compute = E × N tokens via fused `bmm` over (E, N, D) — single kernel, highly optimized.
2318
-
- Sparse dispatch at capacity factor C: total compute = C × N tokens via gather/scatter + grouped GEMM.
2319
-
- At C=8 with E=15: 8N total dispatched (vs 15N dense) — fewer tokens but...
2320
-
- Gather/scatter ops have O(N) overhead INDEPENDENT OF C (the routing always happens)
2321
-
- Grouped GEMM has worse memory access patterns than fused dense bmm
-**For sparse dispatch to win on throughput**: requires C ≤ 1 AND/OR Triton-fused dispatch kernel. Eager-mode sparse dispatch at any reasonable C is throughput-neutral or negative versus fused dense bmm.
2324
-
2325
-
**Implication for queue:**
2326
-
- iter 117b-3 itself is not promotable as-is (eager-mode sparse-dispatch-on-MLP doesn't deliver throughput at any C).
2327
-
- The sparse-MoE-dispatch axis only delivers throughput if **fused via Triton** (= H88 iter 118 territory).
2328
-
- iter 117b-2-fix (Triton entmax with E=30→32 padding) is now the prerequisite for any sparsity throughput win — promote its priority.
2329
-
2330
-
**Distinguishes from iter 117b-2 NOT-VIABLE**: that was a kernel input-shape constraint (fixable). This is an algorithmic/architectural finding (eager sparse dispatch is structurally throughput-negative).
2331
-
2332
-
**Status:** NOT-PROMOTED ✗ at C=8 (would-be verification too expensive). Capacity sweep down to C=2/C=1 abandoned because the eager-mode overhead floor dominates regardless of C. Sparse dispatch path PARKED until iter 117b-2-fix (Triton entmax) lands and a fused-dispatch kernel can be designed alongside.
2333
-
2334
-
**Implication for Tier 1**: skipping iter 117b-3b (sparse-Q attention) — same code path class, will hit the same eager-overhead ceiling. Proceeding to iter 117b-2-fix (Triton kernel padding) as the throughput-delivering iter.
2318
+
**Status:** Iter 117b-3 RELAUNCH PENDING with corrected understanding. Will run full 1000 steps and measure true asymptotic step_avg + final val_bpb. The C=8 verification could deliver bit-identical val_bpb to iter 95 (within ±0.01) at modest wallclock overhead.
2335
2319
2336
2320
### Iter 117b-2 NOT VIABLE (2026-05-02): Triton entmax kernel rejects E=30 (non-power-of-2)
deq_bptt_k=3# iter 95 (2026-05-02): TBPTT=2 → 3 under iter 112+122 baseline. Triggered by grad_norm=0.07 in mid-flight of iter 112+122 (well below clip=1.0 → headroom for deeper backward). Backward coverage at K=16 increases 12.5%→19%. Expected ~+10% wallclock cost. Iter 85 (H63) PROMOTED stochastic {2,3,4} earlier but at WD=0.30 / K-jitter (4,6,10) era — different regime; this is fixed=3 retest under WD=0.01 / K-jitter (16,24).
516
+
deq_bptt_k=3# iter 95 (2026-05-02): TBPTT=2 → 3 PROMOTED ★ under iter 112+122 baseline. Triggered by grad_norm=0.07 in mid-flight of iter 112+122 (well below clip=1.0 → headroom for deeper backward). Backward coverage at K=16 increases 12.5%→19%. Cost ~+10% wallclock; val_bpb int6 1.5001 (Δ-0.0164 vs iter 112+122) and K-sweep tightening confirmed.
517
517
# Iter 85 enabled stochastic TBPTT {2,3,4} as a K-jitter analog (H63
0 commit comments