[moe] Good 10T: inv-sqrt LR schedule

## Summary

This issue began as a check on inverse-square-root decay for the good-enough 10T MoE recipe, but the thread’s main takeaway is that inverse-sqrt is no longer the favored schedule. After fitting Defazio-style schedule refinement to the baseline run’s gradient norms, the author found that quadratic decay `(1-t)^2` matched the observed profile much better than inverse-sqrt or linear decay, and launched a sweep comparing those alternatives plus minimum-LR and diagnostic variants. PR #4050 added the schedule implementations and experiment wiring, so the current recommendation is to treat inverse-sqrt as one candidate in the sweep while quadratic decay is the leading hypothesis pending run results.

### Helpful links
- PR #4050
- Decisive analysis comment: https://github.com/marin-community/marin/issues/4028#issuecomment-4115031200
- Good 10T gate: #4013


## Description
TL;DR: Run the good-enough 10T check for an inverse-square-root learning-rate schedule rather than leaving it as an untested idea.

## Hypothesis or Goal
We want to know whether the inverse-square-root schedule is strong enough to deserve early promotion in the TPU recipe.

### Links
* Parent sweep: #3469\n* Gate: #4013

## Results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[moe] Good 10T: inv-sqrt LR schedule #4028

Summary

Helpful links

Description

Hypothesis or Goal

Links

Results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[moe] Good 10T: inv-sqrt LR schedule #4028

Description

Summary

Helpful links

Description

Hypothesis or Goal

Links

Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions