[moe] Add Muon throughput experiment for grug MoE by claude[bot] · Pull Request #4053 · marin-community/marin

claude · 2026-03-23T21:51:30Z

Add experiment script that runs the grug MoE trial model with GrugMuonConfig instead of AdamConfig, keeping all other settings identical (model arch, data, batch size, steps). This enables direct comparison of wall-clock step time and MFU between Adam and Muon on MoE, which is the deliverable for the perf half of the Muon-on-MoE evaluation. Note that PR #3902 (Muon MoE orthogonalization improvements) should land first for best results.

Part of #4033

Add experiment script that runs the same grug MoE architecture as the Adam baseline but with GrugMuonConfig, enabling direct wall-clock step time and MFU comparison between Adam and Muon on MoE. Part of #4033 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ef3a7aac1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T21:54:10Z

+        ),
+        optimizer=versioned(
+            GrugMuonConfig(
+                lr=0.02,


Use learning_rate instead of lr in Muon config

GrugMuonConfig is instantiated with lr=0.02, but the optimizer schedule in lib/levanter/src/levanter/optim/muon.py is built from self.lr_scheduler(...), which reads OptimizerConfig.learning_rate rather than lr; this means this experiment will silently run at the default learning_rate (6e-4) instead of the intended Muon LR and can invalidate the Adam-vs-Muon comparison this script is meant to produce. Set learning_rate=0.02 (or otherwise populate learning_rate) so the run uses the intended hyperparameters.

Useful? React with 👍 / 👎.

github-actions · 2026-04-16T01:53:23Z

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

claude Bot added the agent-generated Created by automation/agent label Mar 23, 2026

claude Bot mentioned this pull request Mar 23, 2026

[moe] Great 10T: Muon perf on MoE #4033

Open

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

This was referenced Mar 30, 2026

Modeling April 2026 #4266

Closed

MoE Scaling up to April goal #4281

Open

github-actions Bot added the stale label Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[moe] Add Muon throughput experiment for grug MoE#4053

[moe] Add Muon throughput experiment for grug MoE#4053
claude[bot] wants to merge 1 commit intomainfrom
agent/20260323-fix-4033

claude Bot commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

claude Bot commented Mar 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants