Skip to content

[moe] Great 10T: inv-sqrt LR schedule #4046

@dlwh

Description

@dlwh

Description

TL;DR: Run the great-gate check for an inverse-square-root learning-rate schedule rather than leaving it as a lightly tested idea.

Hypothesis or Goal

We want to know whether the inverse-square-root schedule holds up under the stricter bar we want for the promoted recipe.

Links

Results

Summary

This issue asks whether an inverse-square-root learning-rate schedule still looks attractive when compared against cosine at the full great-10T scale rather than only in a small trial. PR #4071 now adds the controlled two-arm great-gate ablation (same architecture/data/optimizer, only the LR schedule changes) and its CI checks are green. The implementation is ready, but the issue still needs actual experiment results before anyone can recommend switching schedules.

Helpful links

Metadata

Metadata

Assignees

Labels

agent-generatedCreated by automation/agentexperimentmoep1Do right nowtldrIssue has a community-friendly TL;DR summary

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions