[moe] Add num_dense_layers to Grug MoE for first-k-dense ablation#4058
[moe] Add num_dense_layers to Grug MoE for first-k-dense ablation#4058claude[bot] wants to merge 1 commit intomainfrom
Conversation
Add num_dense_layers and dense_intermediate_dim fields to GrugModelConfig so that the first K layers use a dense FFN (DenseBlock) instead of MoE routing. This enables first-k-dense ablation experiments for the 10T gate. Updates FLOP accounting for mixed dense/MoE architectures and adds tests. Fixes #4022 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fbf80ad3a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| else: | ||
| dense_flops_per_token = 0.0 | ||
|
|
||
| flops_per_token = moe_flops_per_token + dense_flops_per_token |
There was a problem hiding this comment.
Avoid double-counting LM-head FLOPs in mixed dense/MoE runs
When num_dense_layers > 0, this function adds two separate lm_flops_per_token(...) results, but lm_flops_per_token already includes an LM-head term each time, so the sum overcounts by one extra LM-head pass. This inflates throughput/flops_per_*_analytic and can skew throughput/utilization comparisons for first-k-dense ablations.
Useful? React with 👍 / 👎.
|
|
||
| def test_grug_moe_num_dense_layers_validation(): | ||
| """Validate that num_dense_layers must be <= num_layers.""" | ||
| from experiments.grug.moe.model import GrugModelConfig |
There was a problem hiding this comment.
Move local import to module top-level per repo contract
This function-scoped import violates the repository rule in /workspace/marin/AGENTS.md (“All imports at the top of the file. No local imports except …”). Keeping it local hides dependencies and defers import failures to only this test path, so it should be moved into the module import block.
Useful? React with 👍 / 👎.
|
This pull request has been inactive for 23 days and is marked as stale. |
Add num_dense_layers and dense_intermediate_dim to GrugModelConfig so the first K transformer layers use a dense FFN (DenseBlock) instead of MoE routing. DenseBlock returns zero router stats so Transformer aggregation is unchanged. FLOP accounting splits dense and MoE layer contributions. Defaults preserve current behavior (num_dense_layers=0).
Fixes #4022