Summary
This issue tracked whether AdamH should remain the long-run optimizer default for the Great 10T MoE gate once the rest of the recipe was fixed. A multi-scale AdamH-vs-Adam experiment PR was opened, but it later went stale, and follow-up subexperiments became the useful evidence. Those follow-ups found several AdamH changes were not worth promoting; the token-embedding exception was resolved in #5804, where plain Adam hurt d512 but slightly improved d768/d1024 and was chosen mainly because higher-LR AdamH embeddings showed large gradient spikes. The current disposition is that this AdamH-vs-Adam gate is historical rather than active: the optimizer path has moved toward MuonH, with Adam used where it is safer for embeddings.
Helpful links
Description
TL;DR: Run the full 10T comparison between AdamH and Adam as part of the great gate.
Hypothesis or Goal
We want to know whether AdamH remains a defensible long-run default once we hold the rest of the recipe fixed.
Links
Results
Agent MoE catalogue
| Agent issue |
Verdict |
| #5000 |
Gate 2 FAIL: Adam LR ratio shifts are inconsistent; current AdamH:Adam ratio is well calibrated. |
| #5182 |
Gate 2 FAIL: global gradient normalization recovers small scale but regresses larger scale. |
| #5184 |
Marginal Gate 2 PASS: AdamH on token_embed is a small fading win and preferred if keeping AdamH strategy. |
| #5719 |
Routing attention K/V to AdamH regresses at d512 and is noise at larger scales. |
| #5804 |
OPEN: token_embed on plain Adam regresses d512, improves d768 modestly; d1024 was still running when checked. |
Summary
This issue tracked whether AdamH should remain the long-run optimizer default for the Great 10T MoE gate once the rest of the recipe was fixed. A multi-scale AdamH-vs-Adam experiment PR was opened, but it later went stale, and follow-up subexperiments became the useful evidence. Those follow-ups found several AdamH changes were not worth promoting; the token-embedding exception was resolved in #5804, where plain Adam hurt d512 but slightly improved d768/d1024 and was chosen mainly because higher-LR AdamH embeddings showed large gradient spikes. The current disposition is that this AdamH-vs-Adam gate is historical rather than active: the optimizer path has moved toward MuonH, with Adam used where it is safer for embeddings.
Helpful links
Description
TL;DR: Run the full 10T comparison between AdamH and Adam as part of the great gate.
Hypothesis or Goal
We want to know whether AdamH remains a defensible long-run default once we hold the rest of the recipe fixed.
Links
Results
Agent MoE catalogue