Skip to content

[moe] Add num_dense_layers to Grug MoE for first-k-dense ablation#4058

Open
claude[bot] wants to merge 1 commit intomainfrom
agent/20260323-fix-4022
Open

[moe] Add num_dense_layers to Grug MoE for first-k-dense ablation#4058
claude[bot] wants to merge 1 commit intomainfrom
agent/20260323-fix-4022

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Mar 23, 2026

Add num_dense_layers and dense_intermediate_dim to GrugModelConfig so the first K transformer layers use a dense FFN (DenseBlock) instead of MoE routing. DenseBlock returns zero router stats so Transformer aggregation is unchanged. FLOP accounting splits dense and MoE layer contributions. Defaults preserve current behavior (num_dense_layers=0).

Fixes #4022

Add num_dense_layers and dense_intermediate_dim fields to GrugModelConfig
so that the first K layers use a dense FFN (DenseBlock) instead of MoE
routing. This enables first-k-dense ablation experiments for the 10T gate.
Updates FLOP accounting for mixed dense/MoE architectures and adds tests.

Fixes #4022

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude claude Bot added the agent-generated Created by automation/agent label Mar 23, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fbf80ad3a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

else:
dense_flops_per_token = 0.0

flops_per_token = moe_flops_per_token + dense_flops_per_token
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid double-counting LM-head FLOPs in mixed dense/MoE runs

When num_dense_layers > 0, this function adds two separate lm_flops_per_token(...) results, but lm_flops_per_token already includes an LM-head term each time, so the sum overcounts by one extra LM-head pass. This inflates throughput/flops_per_*_analytic and can skew throughput/utilization comparisons for first-k-dense ablations.

Useful? React with 👍 / 👎.


def test_grug_moe_num_dense_layers_validation():
"""Validate that num_dense_layers must be <= num_layers."""
from experiments.grug.moe.model import GrugModelConfig
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move local import to module top-level per repo contract

This function-scoped import violates the repository rule in /workspace/marin/AGENTS.md (“All imports at the top of the file. No local imports except …”). Keeping it local hides dependencies and defers import failures to only this test path, so it should be moved into the module import block.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

@github-actions github-actions Bot added the stale label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[moe] Good 10T: ablate first-k dense

0 participants