Summary
This issue began as a check on inverse-square-root decay for the good-enough 10T MoE recipe, but the thread’s main takeaway is that inverse-sqrt is no longer the favored schedule. After fitting Defazio-style schedule refinement to the baseline run’s gradient norms, the author found that quadratic decay (1-t)^2 matched the observed profile much better than inverse-sqrt or linear decay, and launched a sweep comparing those alternatives plus minimum-LR and diagnostic variants. PR #4050 added the schedule implementations and experiment wiring, so the current recommendation is to treat inverse-sqrt as one candidate in the sweep while quadratic decay is the leading hypothesis pending run results.
Helpful links
Description
TL;DR: Run the good-enough 10T check for an inverse-square-root learning-rate schedule rather than leaving it as an untested idea.
Hypothesis or Goal
We want to know whether the inverse-square-root schedule is strong enough to deserve early promotion in the TPU recipe.
Links
Results
Summary
This issue began as a check on inverse-square-root decay for the good-enough 10T MoE recipe, but the thread’s main takeaway is that inverse-sqrt is no longer the favored schedule. After fitting Defazio-style schedule refinement to the baseline run’s gradient norms, the author found that quadratic decay
(1-t)^2matched the observed profile much better than inverse-sqrt or linear decay, and launched a sweep comparing those alternatives plus minimum-LR and diagnostic variants. PR #4050 added the schedule implementations and experiment wiring, so the current recommendation is to treat inverse-sqrt as one candidate in the sweep while quadratic decay is the leading hypothesis pending run results.Helpful links
Description
TL;DR: Run the good-enough 10T check for an inverse-square-root learning-rate schedule rather than leaving it as an untested idea.
Hypothesis or Goal
We want to know whether the inverse-square-root schedule is strong enough to deserve early promotion in the TPU recipe.
Links
Results