Skip to content

[moe] Add inv-sqrt LR schedule experiment for Good 10T gate#4050

Open
claude[bot] wants to merge 1 commit intomainfrom
agent/20260323-fix-4028
Open

[moe] Add inv-sqrt LR schedule experiment for Good 10T gate#4050
claude[bot] wants to merge 1 commit intomainfrom
agent/20260323-fix-4028

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Mar 23, 2026

Add experiments/grug/moe/inv_sqrt_lr.py which runs the standard MoE trial with lr_schedule=inv_sqrt instead of cosine. All other settings (model, data, resources, steps) match the baseline in launch.py for a controlled comparison. The inv-sqrt schedule decays continuously from peak LR using 1/sqrt(step/timescale), which may suit long training runs better than cosine.

Fixes #4028

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude claude Bot added the agent-generated Created by automation/agent label Mar 23, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30b49fb264

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

grug_moe_inv_sqrt_lr = ExecutorStep(
name="grug/moe-inv-sqrt-lr",
fn=run_grug_moe,
config=GrugMoeLaunchConfig(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore baseline trainer/eval settings

GrugMoeLaunchConfig is constructed here without grug_trainer and eval, so this run falls back to defaults from experiments/grug/moe/train.py (z_loss_weight=0.0, max_eval_batches=None, eval_ema=True) instead of the baseline settings in experiments/grug/moe/launch.py (z_loss_weight=1e-4, max_eval_batches=8, eval_ema=False). That changes optimization and evaluation behavior beyond LR schedule, so the experiment is not apples-to-apples and cannot isolate the inv-sqrt effect.

Useful? React with 👍 / 👎.

Comment on lines +35 to +37
lr_schedule="inv_sqrt",
min_lr_ratio=0.1,
warmup=1000,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Set an effective inv-sqrt decay timescale

Using string lr_schedule="inv_sqrt" here routes through OptimizerConfig.lr_scheduler, which hardcodes inv-sqrt timescale=10000 (lib/levanter/src/levanter/optim/config.py). With this experiment’s 2,000-step run and 1,000 warmup steps, min(1, 1/sqrt((count+warmup)/timescale)) never drops below 1, so LR never decays after warmup. This means the run is effectively constant-LR vs cosine rather than testing inverse-sqrt decay as intended.

Useful? React with 👍 / 👎.

@claude claude Bot mentioned this pull request Mar 31, 2026
3 tasks
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions Bot added the stale label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[moe] Good 10T: inv-sqrt LR schedule

0 participants