Agent MoE Experiment: AdamH gradient normalization

## Description

Test a Grug MoE AdamH variant that normalizes each module's gradients to RMS 1 before AdamH moment updates. The code path is local to `experiments/grug/moe/` and compares against the compute-optimal MoE baselines in `experiments/grug/moe/README.md`.

Exact initiating prompt: "Read experiments/grug/moe/README.md on how to iterate MoE, implement a variant of AdamH optimizer that perform gradient normalization (scale gradient of each module to RMSNorm 1) and test it against the previous method"

## TL;DR

Gate 1 is complete. The variant improved the d768 point but missed the d512 point, so it does not pass gate 1 and should not advance to gate 2 under `experiments/grug/moe/agent.md`.

## Hypothesis or Goal

Module-wise gradient RMS normalization reduces optimizer scale mismatch across attention, shared expert, and routed expert modules, improving effective speedup without changing AdamH's projected parameter update rule.

### Links

* Logbook: `.agents/logbooks/moe-adamh-grad-norm.md`
* Launch: `experiments/grug/moe/launch_adamh_grad_norm.py`
* PR: #5181
* Gate 1 Iris parent job: `/kaiyue/iris-run-job-20260425-184632`
* Data browser: https://marin.community/data-browser/experiment?path=gs%3A//marin-us-east5/experiments/launch_adamh_grad_norm-7614bf.json

## Results

| Scale | Baseline loss | Variant loss | Loss delta | Baseline tok/s | Variant tok/s | Tok/s delta | Effective speedup |
|-------|---------------|--------------|------------|----------------|---------------|-------------|-------------------|
| d512  | 3.8104        | 3.815110     | +0.004710  | 405,630        | 406,983       | +0.333%     | 0.980893          |
| d768  | 3.4339        | 3.429193     | -0.004707  | 273,532        | 274,218       | +0.251%     | 1.030269          |

Both W&B runs reached `finished`, and both Iris child jobs reached `JOB_STATE_SUCCEEDED`.

## Decision Log

* 2026-04-25: submitted gate 1 on Iris for d512 and d768.
* 2026-04-25: d512 finished with effective speedup 0.980893, below the required threshold.
* 2026-04-25: d768 finished with effective speedup 1.030269, but gate 1 requires both small-scale points to exceed 1.0.

## Conclusion

Completed negative result. AdamH module gradient RMS normalization does not pass gate 1, so no gate 2 run is launched for this variant.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent MoE Experiment: AdamH gradient normalization #5180

Description

TL;DR

Hypothesis or Goal

Links

Results

Decision Log

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scale	Baseline loss	Variant loss	Loss delta	Baseline tok/s	Variant tok/s	Tok/s delta	Effective speedup
d512	3.8104	3.815110	+0.004710	405,630	406,983	+0.333%	0.980893
d768	3.4339	3.429193	-0.004707	273,532	274,218	+0.251%	1.030269

Agent MoE Experiment: AdamH gradient normalization #5180

Description

Description

TL;DR

Hypothesis or Goal

Links

Results

Decision Log

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions