Skip to content

[moe] Good 10T: inv-sqrt LR schedule #4028

@dlwh

Description

@dlwh

Summary

This issue began as a check on inverse-square-root decay for the good-enough 10T MoE recipe, but the thread’s main takeaway is that inverse-sqrt is no longer the favored schedule. After fitting Defazio-style schedule refinement to the baseline run’s gradient norms, the author found that quadratic decay (1-t)^2 matched the observed profile much better than inverse-sqrt or linear decay, and launched a sweep comparing those alternatives plus minimum-LR and diagnostic variants. PR #4050 added the schedule implementations and experiment wiring, so the current recommendation is to treat inverse-sqrt as one candidate in the sweep while quadratic decay is the leading hypothesis pending run results.

Helpful links

Description

TL;DR: Run the good-enough 10T check for an inverse-square-root learning-rate schedule rather than leaving it as an untested idea.

Hypothesis or Goal

We want to know whether the inverse-square-root schedule is strong enough to deserve early promotion in the TPU recipe.

Links

Results

Metadata

Metadata

Assignees

Labels

agent-generatedCreated by automation/agentexperimentmoep1Do right nowtldrIssue has a community-friendly TL;DR summary

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions