Skip to content

[moe] Great 10T: compare AdamH vs Adam #4042

Description

@dlwh

Summary

This issue tracked whether AdamH should remain the long-run optimizer default for the Great 10T MoE gate once the rest of the recipe was fixed. A multi-scale AdamH-vs-Adam experiment PR was opened, but it later went stale, and follow-up subexperiments became the useful evidence. Those follow-ups found several AdamH changes were not worth promoting; the token-embedding exception was resolved in #5804, where plain Adam hurt d512 but slightly improved d768/d1024 and was chosen mainly because higher-LR AdamH embeddings showed large gradient spikes. The current disposition is that this AdamH-vs-Adam gate is historical rather than active: the optimizer path has moved toward MuonH, with Adam used where it is safer for embeddings.

Helpful links

Description

TL;DR: Run the full 10T comparison between AdamH and Adam as part of the great gate.

Hypothesis or Goal

We want to know whether AdamH remains a defensible long-run default once we hold the rest of the recipe fixed.

Links

Results

Agent MoE catalogue

Agent issue Verdict
#5000 Gate 2 FAIL: Adam LR ratio shifts are inconsistent; current AdamH:Adam ratio is well calibrated.
#5182 Gate 2 FAIL: global gradient normalization recovers small scale but regresses larger scale.
#5184 Marginal Gate 2 PASS: AdamH on token_embed is a small fading win and preferred if keeping AdamH strategy.
#5719 Routing attention K/V to AdamH regresses at d512 and is noise at larger scales.
#5804 OPEN: token_embed on plain Adam regresses d512, improves d768 modestly; d1024 was still running when checked.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-generatedCreated by automation/agentexperimentmoetldrIssue has a community-friendly TL;DR summary

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions