Skip to content

[grug] Add MoE AdamH gradient normalization#5181

Open
WhenWen wants to merge 4 commits intomainfrom
research/moe-adamh-grad-norm
Open

[grug] Add MoE AdamH gradient normalization#5181
WhenWen wants to merge 4 commits intomainfrom
research/moe-adamh-grad-norm

Conversation

@WhenWen
Copy link
Copy Markdown
Contributor

@WhenWen WhenWen commented Apr 25, 2026

Add a Grug MoE AdamH variant that normalizes each module gradients to RMS 1 before AdamH moment updates. Includes gate-specific launch wiring for d512/d768 and d1024/d1280 comparison runs plus focused optimizer tests.

Part of #5180

@WhenWen WhenWen added the agent-generated Created by automation/agent label Apr 25, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d97fdc3131

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".



def _resolve_run_id(label: str) -> str:
run_id = os.environ.get("GRUG_RUN_ID", f"moe-adamh-grad-norm-{label}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep per-step run IDs unique for gated launches

When GRUG_RUN_ID is set, _resolve_run_id returns the same ID for every label, so gate1/all runs emit multiple steps with identical run_ids. In run_grug_moe_trial, that ID becomes the trainer/W&B run ID (and W&B defaults to resume="allow"), so subsequent steps can resume or overwrite earlier runs instead of producing separate experiment records. This breaks side-by-side ablation tracking for the very comparisons this launcher is meant to run.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant