Skip to content

[moe] Good 10T: measure capacity overflow #4016

@dlwh

Description

@dlwh

Summary

This issue started by asking whether MoE capacity padding was costing enough throughput to justify measuring and possibly lowering it in the good-enough 10T recipe. PR #4052 added capacity-overflow metrics, follow-up analysis found the default 1.25 capacity factor costs roughly 11% throughput at 1e21 scale while lower caps only slightly worsen loss, and a later 1e20 EP=4 comparison showed cap=1.0 was 8.3% faster with only a +0.001 BPB hit. Current conclusion: move the default capacity factor to 1.0, because the speedup appears to outweigh the quality loss on the tested runs.

Helpful links

Description

TL;DR: Measure capacity overflow on the current good-enough 10T candidate so routing overflow is quantified instead of guessed.

Hypothesis or Goal

We want to know whether overflow is materially affecting quality, efficiency, or both on the path we are currently considering.

Links

Results

Metadata

Metadata

Assignees

Labels

agent-generatedCreated by automation/agentexperimentmoep1Do right nowtldrIssue has a community-friendly TL;DR summary

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions