[moe] Add AdamH vs Adam comparison experiment at 1e19 FLOPs#4059
[moe] Add AdamH vs Adam comparison experiment at 1e19 FLOPs#4059claude[bot] wants to merge 1 commit intomainfrom
Conversation
Add GrugAdamHConfig for raw-array grug models (routes 2D weight matrices to scale-invariant AdamH, embeddings/routers/norms to standard Adam). Add experiment script running both optimizers on d=1024 MoE (E=8, K=2) at ~1e19 FLOPs for a controlled comparison. Fixes #4024 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b666a77855
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
|
|
||
| def _resolve_run_id(base: str) -> str: | ||
| run_id = os.environ.get("GRUG_RUN_ID", base) |
There was a problem hiding this comment.
Keep per-step run IDs unique under GRUG_RUN_ID override
_resolve_run_id uses GRUG_RUN_ID verbatim, so both adam_step and adamh_step collapse to the same run_id whenever that env var is set. In this stack, trainer.id is used as the default W&B run id (with resume enabled), so the second run resumes/overwrites the first instead of producing an independent comparison run. That invalidates the Adam-vs-AdamH side-by-side experiment for scripted launches that set GRUG_RUN_ID.
Useful? React with 👍 / 👎.
|
This pull request has been inactive for 23 days and is marked as stale. |
Add GrugAdamHConfig, a grug-compatible AdamH optimizer that classifies parameters by ndim and path name instead of haliax module introspection. Weight matrices (ndim >= 2) get the scale-invariant AdamH update; embeddings, router weights, and norm scalars use standard Adam. Add experiment script that launches both Adam and AdamH on the same d=1024 MoE model (E=8, K=2, shared expert, 13 layers) at ~1e19 FLOPs on Nemotron mix for a controlled optimizer comparison.
Fixes #4024