Skip to content

[moe] Add Muon throughput experiment for grug MoE#4053

Open
claude[bot] wants to merge 1 commit intomainfrom
agent/20260323-fix-4033
Open

[moe] Add Muon throughput experiment for grug MoE#4053
claude[bot] wants to merge 1 commit intomainfrom
agent/20260323-fix-4033

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Mar 23, 2026

Add experiment script that runs the grug MoE trial model with GrugMuonConfig instead of AdamConfig, keeping all other settings identical (model arch, data, batch size, steps). This enables direct comparison of wall-clock step time and MFU between Adam and Muon on MoE, which is the deliverable for the perf half of the Muon-on-MoE evaluation. Note that PR #3902 (Muon MoE orthogonalization improvements) should land first for best results.

Part of #4033

Add experiment script that runs the same grug MoE architecture as the
Adam baseline but with GrugMuonConfig, enabling direct wall-clock step
time and MFU comparison between Adam and Muon on MoE.

Part of #4033

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude claude Bot added the agent-generated Created by automation/agent label Mar 23, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ef3a7aac1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

),
optimizer=versioned(
GrugMuonConfig(
lr=0.02,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use learning_rate instead of lr in Muon config

GrugMuonConfig is instantiated with lr=0.02, but the optimizer schedule in lib/levanter/src/levanter/optim/muon.py is built from self.lr_scheduler(...), which reads OptimizerConfig.learning_rate rather than lr; this means this experiment will silently run at the default learning_rate (6e-4) instead of the intended Muon LR and can invalidate the Adam-vs-Muon comparison this script is meant to produce. Set learning_rate=0.02 (or otherwise populate learning_rate) so the run uses the intended hyperparameters.

Useful? React with 👍 / 👎.

This was referenced Mar 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions Bot added the stale label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants