[moe] Great 10T: inv-sqrt LR schedule

## Description
TL;DR: Run the great-gate check for an inverse-square-root learning-rate schedule rather than leaving it as a lightly tested idea.

## Hypothesis or Goal
We want to know whether the inverse-square-root schedule holds up under the stricter bar we want for the promoted recipe.

### Links
* Parent sweep: #3469
* Gate: #4014

## Results


## Summary

This issue asks whether an inverse-square-root learning-rate schedule still looks attractive when compared against cosine at the full great-10T scale rather than only in a small trial. PR #4071 now adds the controlled two-arm great-gate ablation (same architecture/data/optimizer, only the LR schedule changes) and its CI checks are green. The implementation is ready, but the issue still needs actual experiment results before anyone can recommend switching schedules.

### Helpful links
- Earlier small-scale precursor: #4028
- Great gate tracker: #4014
- Implementation PR: #4071

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[moe] Great 10T: inv-sqrt LR schedule #4046

Description

Hypothesis or Goal

Links

Results

Summary

Helpful links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[moe] Great 10T: inv-sqrt LR schedule #4046

Description

Description

Hypothesis or Goal

Links

Results

Summary

Helpful links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions