Description
TL;DR: Run the great-gate check for an inverse-square-root learning-rate schedule rather than leaving it as a lightly tested idea.
Hypothesis or Goal
We want to know whether the inverse-square-root schedule holds up under the stricter bar we want for the promoted recipe.
Links
Results
Summary
This issue asks whether an inverse-square-root learning-rate schedule still looks attractive when compared against cosine at the full great-10T scale rather than only in a small trial. PR #4071 now adds the controlled two-arm great-gate ablation (same architecture/data/optimizer, only the LR schedule changes) and its CI checks are green. The implementation is ready, but the issue still needs actual experiment results before anyone can recommend switching schedules.
Helpful links
Description
TL;DR: Run the great-gate check for an inverse-square-root learning-rate schedule rather than leaving it as a lightly tested idea.
Hypothesis or Goal
We want to know whether the inverse-square-root schedule holds up under the stricter bar we want for the promoted recipe.
Links
Results
Summary
This issue asks whether an inverse-square-root learning-rate schedule still looks attractive when compared against cosine at the full great-10T scale rather than only in a small trial. PR #4071 now adds the controlled two-arm great-gate ablation (same architecture/data/optimizer, only the LR schedule changes) and its CI checks are green. The implementation is ready, but the issue still needs actual experiment results before anyone can recommend switching schedules.
Helpful links