[moe] Great 10T: compare AdamH vs Adam

## Summary

This issue tracked whether AdamH should remain the long-run optimizer default for the Great 10T MoE gate once the rest of the recipe was fixed. A multi-scale AdamH-vs-Adam experiment PR was opened, but it later went stale, and follow-up subexperiments became the useful evidence. Those follow-ups found several AdamH changes were not worth promoting; the token-embedding exception was resolved in #5804, where plain Adam hurt d512 but slightly improved d768/d1024 and was chosen mainly because higher-LR AdamH embeddings showed large gradient spikes. The current disposition is that this AdamH-vs-Adam gate is historical rather than active: the optimizer path has moved toward MuonH, with Adam used where it is safer for embeddings.

### Helpful links
- #4069 - proposed multi-scale AdamH-vs-Adam suite for this issue, later closed stale.
- #5804 - token_embed Adam sweep final results and decision to move embeddings to Adam.
- https://github.com/marin-community/marin/issues/4042#issuecomment-4640134851 - June 6 disposition: moved to MuonH.


## Description
TL;DR: Run the full 10T comparison between AdamH and Adam as part of the great gate.

## Hypothesis or Goal
We want to know whether AdamH remains a defensible long-run default once we hold the rest of the recipe fixed.

### Links
* Parent sweep: #3469\n* Gate: #4014

## Results


## Agent MoE catalogue

| Agent issue | Verdict |
|---|---|
| #5000 | Gate 2 FAIL: Adam LR ratio shifts are inconsistent; current AdamH:Adam ratio is well calibrated. |
| #5182 | Gate 2 FAIL: global gradient normalization recovers small scale but regresses larger scale. |
| #5184 | Marginal Gate 2 PASS: AdamH on token_embed is a small fading win and preferred if keeping AdamH strategy. |
| #5719 | Routing attention K/V to AdamH regresses at d512 and is noise at larger scales. |
| #5804 | OPEN: token_embed on plain Adam regresses d512, improves d768 modestly; d1024 was still running when checked. |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[moe] Great 10T: compare AdamH vs Adam #4042

Summary

Helpful links

Description

Hypothesis or Goal

Links

Results

Agent MoE catalogue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Agent issue	Verdict
#5000	Gate 2 FAIL: Adam LR ratio shifts are inconsistent; current AdamH:Adam ratio is well calibrated.
#5182	Gate 2 FAIL: global gradient normalization recovers small scale but regresses larger scale.
#5184	Marginal Gate 2 PASS: AdamH on token_embed is a small fading win and preferred if keeping AdamH strategy.
#5719	Routing attention K/V to AdamH regresses at d512 and is noise at larger scales.
#5804	OPEN: token_embed on plain Adam regresses d512, improves d768 modestly; d1024 was still running when checked.

Uh oh!

[moe] Great 10T: compare AdamH vs Adam #4042

Description

Summary

Helpful links

Description

Hypothesis or Goal

Links

Results

Agent MoE catalogue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions