Add MoeAdamHHeuristic, drop dense layers, fix align_kv_heads sharding by ClassicLarry · Pull Request #4636 · marin-community/marin

ClassicLarry · 2026-04-10T20:35:50Z

Summary

Port CompletedAdamHHeuristic from the moe_isoflop_apr_2026 branch into experiments/grug/moe/heuristic.py as MoeAdamHHeuristic. Adds compute_flops_per_token, compute_tokens_and_batch, and build_from_heuristic so launch.py can derive (model, optimizer, batch, steps) from a compute budget + hidden_dim.
Remove the initial-dense-layer path from experiments/grug/moe/model.py (_NUM_DENSE_LAYERS, dense_intermediate_dim, Block.dense_mlp, dense_only init branch) to match the isoflop MoE architecture. Default EP capacity factor back to 1.0. Restore the aligned_v reshard the isoflop branch had.
Rewire experiments/grug/moe/launch.py to use build_from_heuristic for the baseline ExecutorStep instead of hardcoding GrugModelConfig / GrugMoeAdamHConfig. Manual specification is still supported by passing configs directly.
Add experiments/grug/moe/README.md covering architecture, the scaling heuristic, v16 isoflop best runs per compute budget, projections (Paloma macro L∞ pinned at 1.6), and promotion criteria.
Fix lib/levanter/src/levanter/grug/attention.py::align_kv_heads: replace jnp.repeat with a reshape+broadcast pattern. Resolves the ValueError: Please pass sharding to jnp.repeat via out_sharding parameter crash under abstract-mesh training contexts.

Test plan

Verified loss curves match the published isoflop-moe-v16-1e+18-d1024 run to within ±0.005 at matching steps on a v5p-8 iris run (4_10_test_moe).
build_from_heuristic(budget=1e18, hidden_dim=1024) reproduces the isoflop d1024 hyperparameters exactly (lr=0.01, adam_lr=0.002308, beta2=0.999, bs=32, steps=5622).
Pre-commit linter + type check (./infra/pre-commit.py) passing.
Reviewer: confirm the align_kv_heads reshape is equivalent to the old jnp.repeat under GQA.

🤖 Generated with Claude Code

- experiments/grug/moe/heuristic.py: port CompletedAdamHHeuristic from the moe_isoflop_apr_2026 branch, rename to MoeAdamHHeuristic. Adds compute_flops_per_token, compute_tokens_and_batch, and build_from_heuristic helpers so launch.py can derive (model, optimizer, batch, steps) from a compute budget + hidden_dim. - experiments/grug/moe/launch.py: use build_from_heuristic for the baseline ExecutorStep instead of hardcoding a GrugModelConfig and GrugMoeAdamHConfig. Manual specification is still supported by passing GrugModelConfig / GrugMoeAdamHConfig directly to GrugMoeLaunchConfig. - experiments/grug/moe/model.py: remove the initial-dense-layer path (_NUM_DENSE_LAYERS, dense_intermediate_dim, Block.dense_mlp, dense_only init branch) to match the isoflop architecture. Default EP capacity factor back to 1.0. Add the reshard on aligned_v that the isoflop branch had. - experiments/grug/moe/README.md: current best recipe, scaling heuristic summary, v16 isoflop best runs per budget, projections (L∞=1.6), and promotion criteria. - lib/levanter/src/levanter/grug/attention.py: replace jnp.repeat in align_kv_heads with a reshape+broadcast pattern. Fixes "Please pass sharding to jnp.repeat via out_sharding parameter" under abstract mesh contexts.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c333125126

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-10T20:40:00Z

experiments/grug/moe/heuristic.py

+    model_cfg = h.build_model_config(hidden_dim)
+    fpt = compute_flops_per_token(model_cfg)


Propagate seq_len through heuristic model construction

build_from_heuristic accepts a seq_len override, but it builds the model with h.build_model_config(hidden_dim) and then computes FLOPs from that config, so FLOPs are still based on the hardcoded 4096 sequence length. If a caller passes a non-default seq_len, the returned (batch_size, num_steps) is computed with the override while model shape/FLOP estimation remains at 4096, causing the compute budget calculation to drift and mis-size experiments.

Useful? React with 👍 / 👎.

Layer counts in the v16 isoflop table were wrong (d768: 10→8, d1536: 14→16, d2048: 18→21). Verified against wandb run configs. Also switch URL encoding from %2B to literal + so the links resolve correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ClassicLarry added the agent-generated Created by automation/agent label Apr 10, 2026

chatgpt-codex-connector bot reviewed Apr 10, 2026

View reviewed changes

ClassicLarry and others added 4 commits April 10, 2026 13:45

Add isoflop curvature stability criterion to promotion gates

0433136

Drop unused seq_len kwarg from heuristic helpers

3337c84

Expose heuristic baseline as GRUG_MOE_TRIAL_MODEL for canary_ferry

a482c85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MoeAdamHHeuristic, drop dense layers, fix align_kv_heads sharding#4636

Add MoeAdamHHeuristic, drop dense layers, fix align_kv_heads sharding#4636
ClassicLarry wants to merge 5 commits intomainfrom
grug_moe_heuristic

ClassicLarry commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		model_cfg = h.build_model_config(hidden_dim)
		fpt = compute_flops_per_token(model_cfg)

Conversation

ClassicLarry commented Apr 10, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant