Skip to content

Commit 1a033b2

Browse files
mzhong4claude
andcommitted
iter 117b-3 NOT-THROUGHPUT-DELIVERING ✗: sparse-dispatch eager-mode overhead dominates dense fused bmm
KILLED at s10 (~11min after launch). step_avg trajectory 111->33s converging to ~28-30s steady-state, 42% slower than dense iter 95 baseline 23.5s. Throughput economics analysis (the principled finding): Dense MoE: E*N tokens via fused bmm (single highly-optimized kernel) Sparse C=8: 8N tokens via gather/scatter + grouped GEMM Gather/scatter overhead is O(N) INDEPENDENT of C (always happens) Grouped GEMM has worse memory access patterns than dense bmm Net: 8N tokens + scatter overhead > 15N fused-dense For sparse to win: requires C <= 1 AND Triton-fused dispatch kernel. Eager-mode sparse dispatch at any C is throughput-neutral or negative. Implication for queue: - iter 117b-3 not promotable as-is at any capacity factor - iter 117b-3b (sparse-Q attention) SKIPPED -- same code path class, will hit same overhead ceiling - sparsity throughput axis PARKED until iter 117b-2-fix lands (Triton entmax with E=30->32 padding, then design fused dispatch) - autonomous protocol jumps to iter 117b-2-fix priority Distinct from iter 117b-2 NOT-VIABLE: that was a kernel input-shape bug (fixable via padding); this is an algorithmic finding (eager sparse dispatch is structurally throughput-negative vs fused dense bmm). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d9bce1c commit 1a033b2

1 file changed

Lines changed: 28 additions & 0 deletions

File tree

experiments/hypotheses.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2305,6 +2305,34 @@ k_sweep_table: 128 1.5005 0.4007 0.2008 0.3169 0.0267 0.0380
23052305

23062306
**Status:** PROMOTED ★. Baseline updated. 7,447,773-byte artifact rotated to `experiments/weights/baseline/`. Plots regenerated. Continuing autonomous Tier 1: iter 117b-2 (Triton entmax) next per Tier 1 reorder.
23072307

2308+
### Iter 117b-3 NOT THROUGHPUT-DELIVERING (2026-05-02): sparse MoE dispatch + eager entmax slower than fused dense bmm
2309+
2310+
**Outcome:** KILLED ✗ at s10 (~11min after launch). NOT a correctness failure; a throughput-economics finding.
2311+
2312+
**Launch attempt:** 2026-05-02 15:25, run_id `8b92feab`, flags `--use-entmax-routing=1 --use-sparse-dispatch=1 --sparse-dispatch-capacity-factor=8`. Required CLI fix `d9bce1c` to register `--sparse-dispatch-capacity-factor` in `_CLI_TUNABLE_KNOBS`.
2313+
2314+
**step_avg trajectory (s1–s10): 111s → 66 → 54 → 46 → 41 → 39 → 37 → 35 → 34 → 33.2s**. Slowing but converging to ~28-30s steady-state — **42% slower than dense iter 95 baseline (23.5s)**. Estimated 1000-step run = 8.3h (vs iter 95 6.5h). NOT ACCEPTABLE for what was supposed to be a "≈ dense bit-identical" verification step.
2315+
2316+
**Throughput economics analysis (the principled finding):**
2317+
- Dense MoE: total compute = E × N tokens via fused `bmm` over (E, N, D) — single kernel, highly optimized.
2318+
- Sparse dispatch at capacity factor C: total compute = C × N tokens via gather/scatter + grouped GEMM.
2319+
- At C=8 with E=15: 8N total dispatched (vs 15N dense) — fewer tokens but...
2320+
- Gather/scatter ops have O(N) overhead INDEPENDENT OF C (the routing always happens)
2321+
- Grouped GEMM has worse memory access patterns than fused dense bmm
2322+
- Net: 8N tokens × per-token + gather/scatter overhead > 15N × per-token-fused-bmm
2323+
- **For sparse dispatch to win on throughput**: requires C ≤ 1 AND/OR Triton-fused dispatch kernel. Eager-mode sparse dispatch at any reasonable C is throughput-neutral or negative versus fused dense bmm.
2324+
2325+
**Implication for queue:**
2326+
- iter 117b-3 itself is not promotable as-is (eager-mode sparse-dispatch-on-MLP doesn't deliver throughput at any C).
2327+
- The sparse-MoE-dispatch axis only delivers throughput if **fused via Triton** (= H88 iter 118 territory).
2328+
- iter 117b-2-fix (Triton entmax with E=30→32 padding) is now the prerequisite for any sparsity throughput win — promote its priority.
2329+
2330+
**Distinguishes from iter 117b-2 NOT-VIABLE**: that was a kernel input-shape constraint (fixable). This is an algorithmic/architectural finding (eager sparse dispatch is structurally throughput-negative).
2331+
2332+
**Status:** NOT-PROMOTED ✗ at C=8 (would-be verification too expensive). Capacity sweep down to C=2/C=1 abandoned because the eager-mode overhead floor dominates regardless of C. Sparse dispatch path PARKED until iter 117b-2-fix (Triton entmax) lands and a fused-dispatch kernel can be designed alongside.
2333+
2334+
**Implication for Tier 1**: skipping iter 117b-3b (sparse-Q attention) — same code path class, will hit the same eager-overhead ceiling. Proceeding to iter 117b-2-fix (Triton kernel padding) as the throughput-delivering iter.
2335+
23082336
### Iter 117b-2 NOT VIABLE (2026-05-02): Triton entmax kernel rejects E=30 (non-power-of-2)
23092337

23102338
**Outcome:** NOT VIABLE ✗ (kernel design constraint, not a stack-compatibility failure).

0 commit comments

Comments
 (0)