[moe] Good 10T: measure capacity overflow

## Summary

This issue started by asking whether MoE capacity padding was costing enough throughput to justify measuring and possibly lowering it in the good-enough 10T recipe. PR #4052 added capacity-overflow metrics, follow-up analysis found the default 1.25 capacity factor costs roughly 11% throughput at 1e21 scale while lower caps only slightly worsen loss, and a later 1e20 EP=4 comparison showed cap=1.0 was 8.3% faster with only a +0.001 BPB hit. Current conclusion: move the default capacity factor to 1.0, because the speedup appears to outweigh the quality loss on the tested runs.

### Helpful links
- [Instrumentation PR #4052](https://github.com/marin-community/marin/pull/4052)
- [Overflow analysis and recommendation](https://github.com/marin-community/marin/issues/4016#issuecomment-4132027737)
- [1e20 cap=1.0 vs 1.25 comparison](https://github.com/marin-community/marin/issues/4016#issuecomment-4144517642)
- [Decision to update the default](https://github.com/marin-community/marin/issues/4016#issuecomment-4144525023)


## Description
TL;DR: Measure capacity overflow on the current good-enough 10T candidate so routing overflow is quantified instead of guessed.

## Hypothesis or Goal
We want to know whether overflow is materially affecting quality, efficiency, or both on the path we are currently considering.

### Links
* Parent sweep: #3469\n* Gate: #4013

## Results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[moe] Good 10T: measure capacity overflow #4016

Summary

Helpful links

Description

Hypothesis or Goal

Links

Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[moe] Good 10T: measure capacity overflow #4016

Description

Summary

Helpful links

Description

Hypothesis or Goal

Links

Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions