Description
Test a Grug MoE AdamH variant that normalizes each module's gradients to RMS 1 before AdamH moment updates. The code path is local to experiments/grug/moe/ and compares against the compute-optimal MoE baselines in experiments/grug/moe/README.md.
Exact initiating prompt: "Read experiments/grug/moe/README.md on how to iterate MoE, implement a variant of AdamH optimizer that perform gradient normalization (scale gradient of each module to RMSNorm 1) and test it against the previous method"
TL;DR
Gate 1 is complete. The variant improved the d768 point but missed the d512 point, so it does not pass gate 1 and should not advance to gate 2 under experiments/grug/moe/agent.md.
Hypothesis or Goal
Module-wise gradient RMS normalization reduces optimizer scale mismatch across attention, shared expert, and routed expert modules, improving effective speedup without changing AdamH's projected parameter update rule.
Links
Results
| Scale |
Baseline loss |
Variant loss |
Loss delta |
Baseline tok/s |
Variant tok/s |
Tok/s delta |
Effective speedup |
| d512 |
3.8104 |
3.815110 |
+0.004710 |
405,630 |
406,983 |
+0.333% |
0.980893 |
| d768 |
3.4339 |
3.429193 |
-0.004707 |
273,532 |
274,218 |
+0.251% |
1.030269 |
Both W&B runs reached finished, and both Iris child jobs reached JOB_STATE_SUCCEEDED.
Decision Log
- 2026-04-25: submitted gate 1 on Iris for d512 and d768.
- 2026-04-25: d512 finished with effective speedup 0.980893, below the required threshold.
- 2026-04-25: d768 finished with effective speedup 1.030269, but gate 1 requires both small-scale points to exceed 1.0.
Conclusion
Completed negative result. AdamH module gradient RMS normalization does not pass gate 1, so no gate 2 run is launched for this variant.
Description
Test a Grug MoE AdamH variant that normalizes each module's gradients to RMS 1 before AdamH moment updates. The code path is local to
experiments/grug/moe/and compares against the compute-optimal MoE baselines inexperiments/grug/moe/README.md.Exact initiating prompt: "Read experiments/grug/moe/README.md on how to iterate MoE, implement a variant of AdamH optimizer that perform gradient normalization (scale gradient of each module to RMSNorm 1) and test it against the previous method"
TL;DR
Gate 1 is complete. The variant improved the d768 point but missed the d512 point, so it does not pass gate 1 and should not advance to gate 2 under
experiments/grug/moe/agent.md.Hypothesis or Goal
Module-wise gradient RMS normalization reduces optimizer scale mismatch across attention, shared expert, and routed expert modules, improving effective speedup without changing AdamH's projected parameter update rule.
Links
.agents/logbooks/moe-adamh-grad-norm.mdexperiments/grug/moe/launch_adamh_grad_norm.py/kaiyue/iris-run-job-20260425-184632Results
Both W&B runs reached
finished, and both Iris child jobs reachedJOB_STATE_SUCCEEDED.Decision Log
Conclusion
Completed negative result. AdamH module gradient RMS normalization does not pass gate 1, so no gate 2 run is launched for this variant.