Summary
This issue is the tracking thread for pushing Marin's current MoE recipe from the earlier smaller-scale work in #2167 up to roughly 1e21 and 1e22 non-embedding FLOPs, with the immediate question being whether routing and loss stay stable at those scales. As of March 29, 2026, the 1e22 run is live on v4-512 and the thread records that the team is using quantile balancing (QB) instead of older auxiliary-loss-based load balancing; the current expectation is not that the recipe is fully tuned, but that this run will show whether the larger recipe is stable and how it compares against prior dense Delphi and MoE baselines. The latest thread context also notes a provisional predicted paloma/macro_loss of 2.3887 for the 1e22 run, while batch-size scheduling remains static for now and is called out as future tuning work rather than part of this experiment's core claim.
Helpful links
Description
Test if the current scaling recipe in #2167 scales to 1e21 and 1e22 Flop scales without router or general instabilities.
Hypothesis or Goal
Will the loss curve or routing destabilize? Hypothesizing that similar to 2048 width runs at smaller scales, the initial routing on layer zero will look choppy during lr warmup, then stabilize.
Links
1e21 Run: 14B total params, 2B active, 75B tokens.
https://wandb.ai/marin-community/dial_moe/runs/moe-d2304-1e21?nw=nwuserlarrydial
1e22 Run: 35B total params, 5B active, 326B tokens.
https://wandb.ai/marin-community/dial_moe/runs/moe-v7-1e22-d3200?nw=nwuserlarrydial
Results
Pending
Summary
This issue is the tracking thread for pushing Marin's current MoE recipe from the earlier smaller-scale work in #2167 up to roughly 1e21 and 1e22 non-embedding FLOPs, with the immediate question being whether routing and loss stay stable at those scales. As of March 29, 2026, the 1e22 run is live on v4-512 and the thread records that the team is using quantile balancing (QB) instead of older auxiliary-loss-based load balancing; the current expectation is not that the recipe is fully tuned, but that this run will show whether the larger recipe is stable and how it compares against prior dense Delphi and MoE baselines. The latest thread context also notes a provisional predicted
paloma/macro_lossof2.3887for the 1e22 run, while batch-size scheduling remains static for now and is called out as future tuning work rather than part of this experiment's core claim.Helpful links
Description
Test if the current scaling recipe in #2167 scales to 1e21 and 1e22 Flop scales without router or general instabilities.
Hypothesis or Goal
Will the loss curve or routing destabilize? Hypothesizing that similar to 2048 width runs at smaller scales, the initial routing on layer zero will look choppy during lr warmup, then stabilize.
Links
1e21 Run: 14B total params, 2B active, 75B tokens.
https://wandb.ai/marin-community/dial_moe/runs/moe-d2304-1e21?nw=nwuserlarrydial
1e22 Run: 35B total params, 5B active, 326B tokens.
https://wandb.ai/marin-community/dial_moe/runs/moe-v7-1e22-d3200?nw=nwuserlarrydial
Results
Pending