Skip to content

Test MoE Arch at 1e21 and 1e22 Flop Scales #3800

@ClassicLarry

Description

@ClassicLarry

Summary

This issue is the tracking thread for pushing Marin's current MoE recipe from the earlier smaller-scale work in #2167 up to roughly 1e21 and 1e22 non-embedding FLOPs, with the immediate question being whether routing and loss stay stable at those scales. As of March 29, 2026, the 1e22 run is live on v4-512 and the thread records that the team is using quantile balancing (QB) instead of older auxiliary-loss-based load balancing; the current expectation is not that the recipe is fully tuned, but that this run will show whether the larger recipe is stable and how it compares against prior dense Delphi and MoE baselines. The latest thread context also notes a provisional predicted paloma/macro_loss of 2.3887 for the 1e22 run, while batch-size scheduling remains static for now and is called out as future tuning work rather than part of this experiment's core claim.

Helpful links

Description

Test if the current scaling recipe in #2167 scales to 1e21 and 1e22 Flop scales without router or general instabilities.

Hypothesis or Goal

Will the loss curve or routing destabilize? Hypothesizing that similar to 2048 width runs at smaller scales, the initial routing on layer zero will look choppy during lr warmup, then stabilize.

Links

1e21 Run: 14B total params, 2B active, 75B tokens.
https://wandb.ai/marin-community/dial_moe/runs/moe-d2304-1e21?nw=nwuserlarrydial

1e22 Run: 35B total params, 5B active, 326B tokens.
https://wandb.ai/marin-community/dial_moe/runs/moe-v7-1e22-d3200?nw=nwuserlarrydial

Results

Pending

Metadata

Metadata

Assignees

No one assigned

    Labels

    experimentmoetldrIssue has a community-friendly TL;DR summary

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions