Description
TL;DR: Sweep E over {128,256,512} within the great 10T gate so expert count is tested under a stronger standard than the good-enough pass.
Hypothesis or Goal
We want to know whether the preferred expert count survives when we ask for a deeper experimental record.
Links
Results
Summary
This issue is the great-10T follow-up to earlier expert-count sweeps: it asks whether the MoE recipe should still prefer a particular expert count once the comparison is rerun at the stricter gate. PR #4075 now adds the full E={128,256,512} sweep across the great-gate isoflop matrix plus a config-generation test, and its CI checks are green. The implementation work is finished, but the actual training results and any recommendation about which E to keep have not been reported yet.
Helpful links
Description
TL;DR: Sweep E over {128,256,512} within the great 10T gate so expert count is tested under a stronger standard than the good-enough pass.
Hypothesis or Goal
We want to know whether the preferred expert count survives when we ask for a deeper experimental record.
Links
Results
Summary
This issue is the great-10T follow-up to earlier expert-count sweeps: it asks whether the MoE recipe should still prefer a particular expert count once the comparison is rerun at the stricter gate. PR #4075 now adds the full E={128,256,512} sweep across the great-gate isoflop matrix plus a config-generation test, and its CI checks are green. The implementation work is finished, but the actual training results and any recommendation about which E to keep have not been reported yet.
Helpful links