Summary
This issue asked whether the current MoE capacity factor was causing avoidable slowdown on the good-enough 10T path. The answer was yes: follow-up results in #4016 showed that the default 1.25 padding cost about 8% throughput at 1e20 scale and about 11% at 1e21, while lower settings caused only negligible loss impact in the tested regimes. The practical outcome was to move the default capacity factor to 1.0, with 1.1 noted as a more conservative option if future higher-EP or smaller-batch runs show overflow risk.
Helpful links
Description
TL;DR: Sweep capacity factor on the current good-enough 10T candidate and verify that it is not a hidden source of problems.
Hypothesis or Goal
We want to know whether the chosen capacity factor is already safe or whether it is masking avoidable overflow or throughput loss.
Links
Results
Capacity-factor sweep results were reported in #4016 (not in this issue) by @ClassicLarry:
Summary
This issue asked whether the current MoE capacity factor was causing avoidable slowdown on the good-enough 10T path. The answer was yes: follow-up results in #4016 showed that the default
1.25padding cost about8%throughput at1e20scale and about11%at1e21, while lower settings caused only negligible loss impact in the tested regimes. The practical outcome was to move the default capacity factor to1.0, with1.1noted as a more conservative option if future higher-EP or smaller-batch runs show overflow risk.Helpful links
Description
TL;DR: Sweep capacity factor on the current good-enough 10T candidate and verify that it is not a hidden source of problems.
Hypothesis or Goal
We want to know whether the chosen capacity factor is already safe or whether it is masking avoidable overflow or throughput loss.
Links
Results
Capacity-factor sweep results were reported in #4016 (not in this issue) by @ClassicLarry: