Summary
This issue started by asking whether MoE capacity padding was costing enough throughput to justify measuring and possibly lowering it in the good-enough 10T recipe. PR #4052 added capacity-overflow metrics, follow-up analysis found the default 1.25 capacity factor costs roughly 11% throughput at 1e21 scale while lower caps only slightly worsen loss, and a later 1e20 EP=4 comparison showed cap=1.0 was 8.3% faster with only a +0.001 BPB hit. Current conclusion: move the default capacity factor to 1.0, because the speedup appears to outweigh the quality loss on the tested runs.
Helpful links
Description
TL;DR: Measure capacity overflow on the current good-enough 10T candidate so routing overflow is quantified instead of guessed.
Hypothesis or Goal
We want to know whether overflow is materially affecting quality, efficiency, or both on the path we are currently considering.
Links
Results
Summary
This issue started by asking whether MoE capacity padding was costing enough throughput to justify measuring and possibly lowering it in the good-enough 10T recipe. PR #4052 added capacity-overflow metrics, follow-up analysis found the default 1.25 capacity factor costs roughly 11% throughput at 1e21 scale while lower caps only slightly worsen loss, and a later 1e20 EP=4 comparison showed cap=1.0 was 8.3% faster with only a +0.001 BPB hit. Current conclusion: move the default capacity factor to 1.0, because the speedup appears to outweigh the quality loss on the tested runs.
Helpful links
Description
TL;DR: Measure capacity overflow on the current good-enough 10T candidate so routing overflow is quantified instead of guessed.
Hypothesis or Goal
We want to know whether overflow is materially affecting quality, efficiency, or both on the path we are currently considering.
Links
Results