[moe] Add Muon throughput experiment for grug MoE#4053
[moe] Add Muon throughput experiment for grug MoE#4053claude[bot] wants to merge 1 commit intomainfrom
Conversation
Add experiment script that runs the same grug MoE architecture as the Adam baseline but with GrugMuonConfig, enabling direct wall-clock step time and MFU comparison between Adam and Muon on MoE. Part of #4033 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ef3a7aac1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ), | ||
| optimizer=versioned( | ||
| GrugMuonConfig( | ||
| lr=0.02, |
There was a problem hiding this comment.
Use learning_rate instead of lr in Muon config
GrugMuonConfig is instantiated with lr=0.02, but the optimizer schedule in lib/levanter/src/levanter/optim/muon.py is built from self.lr_scheduler(...), which reads OptimizerConfig.learning_rate rather than lr; this means this experiment will silently run at the default learning_rate (6e-4) instead of the intended Muon LR and can invalidate the Adam-vs-Muon comparison this script is meant to produce. Set learning_rate=0.02 (or otherwise populate learning_rate) so the run uses the intended hyperparameters.
Useful? React with 👍 / 👎.
|
This pull request has been inactive for 23 days and is marked as stale. |
Add experiment script that runs the grug MoE trial model with GrugMuonConfig instead of AdamConfig, keeping all other settings identical (model arch, data, batch size, steps). This enables direct comparison of wall-clock step time and MFU between Adam and Muon on MoE, which is the deliverable for the perf half of the Muon-on-MoE evaluation. Note that PR #3902 (Muon MoE orthogonalization improvements) should land first for best results.
Part of #4033