[moe] Add inv-sqrt LR schedule experiment for Good 10T gate#4050
[moe] Add inv-sqrt LR schedule experiment for Good 10T gate#4050claude[bot] wants to merge 1 commit intomainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30b49fb264
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| grug_moe_inv_sqrt_lr = ExecutorStep( | ||
| name="grug/moe-inv-sqrt-lr", | ||
| fn=run_grug_moe, | ||
| config=GrugMoeLaunchConfig( |
There was a problem hiding this comment.
Restore baseline trainer/eval settings
GrugMoeLaunchConfig is constructed here without grug_trainer and eval, so this run falls back to defaults from experiments/grug/moe/train.py (z_loss_weight=0.0, max_eval_batches=None, eval_ema=True) instead of the baseline settings in experiments/grug/moe/launch.py (z_loss_weight=1e-4, max_eval_batches=8, eval_ema=False). That changes optimization and evaluation behavior beyond LR schedule, so the experiment is not apples-to-apples and cannot isolate the inv-sqrt effect.
Useful? React with 👍 / 👎.
| lr_schedule="inv_sqrt", | ||
| min_lr_ratio=0.1, | ||
| warmup=1000, |
There was a problem hiding this comment.
Set an effective inv-sqrt decay timescale
Using string lr_schedule="inv_sqrt" here routes through OptimizerConfig.lr_scheduler, which hardcodes inv-sqrt timescale=10000 (lib/levanter/src/levanter/optim/config.py). With this experiment’s 2,000-step run and 1,000 warmup steps, min(1, 1/sqrt((count+warmup)/timescale)) never drops below 1, so LR never decays after warmup. This means the run is effectively constant-LR vs cosine rather than testing inverse-sqrt decay as intended.
Useful? React with 👍 / 👎.
|
This pull request has been inactive for 23 days and is marked as stale. |
Add experiments/grug/moe/inv_sqrt_lr.py which runs the standard MoE trial with lr_schedule=inv_sqrt instead of cosine. All other settings (model, data, resources, steps) match the baseline in launch.py for a controlled comparison. The inv-sqrt schedule decays continuously from peak LR using 1/sqrt(step/timescale), which may suit long training runs better than cosine.
Fixes #4028