[grug] Fix multi-host MoE ShardingTypeError on shared-expert residual by claude[bot] · Pull Request #6297 · marin-community/marin

claude · 2026-06-09T04:36:48Z

DenseMLP (shared expert) reshards its output on the flat [T, D] tensor via the einsum out_sharding and reshapes after, while MoEMLP (routed) reshapes first and reshards after. Splitting the fused (replica_dcn, data, expert) token axis back into (batch, seq) leaks the expert axis onto the seq dim, so the shared-expert residual add at model.py:467 disagrees with the routed path once replica_dcn > 1. On a single host the two layouts coincide, which is why single-node and TPU canary runs pass.

Reshard DenseMLP's output after the reshape so it carries the same canonical batch sharding as the routed output before the residual add. The reshard is a no-op on a trivial mesh, so single-host behavior is unchanged.

Adds a lowering regression test on a replica_dcn=2 abstract mesh (the canary regime: expert axis size 1, two replicas), which reproduces the exact ShardingTypeError before the fix.

Fixes #6296

DenseMLP (shared expert) reshards its output on the flat [T, D] tensor and reshapes after, while MoEMLP (routed) reshapes first and reshards after. Splitting the fused (replica_dcn, data, expert) token axis back into (batch, seq) leaks the expert axis onto the seq dim, so the shared-expert residual add disagrees with the routed path once replica_dcn > 1. On a single host the two layouts coincide, which is why single-node and TPU canary runs pass. Reshard DenseMLP's output after the reshape so it carries the same canonical batch sharding as the routed output before the residual add.

claude Bot added the agent-generated Created by automation/agent label Jun 9, 2026

claude Bot mentioned this pull request Jun 9, 2026

[grug] Multi-host MoE training fails with ShardingTypeError on shared-expert residual add #6296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[grug] Fix multi-host MoE ShardingTypeError on shared-expert residual#6297

[grug] Fix multi-host MoE ShardingTypeError on shared-expert residual#6297
claude[bot] wants to merge 1 commit into
mainfrom
agent/20260609-fix-6296

claude Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

claude Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants