Skip to content

[grug] Fix multi-host MoE ShardingTypeError on shared-expert residual#6297

Open
claude[bot] wants to merge 1 commit into
mainfrom
agent/20260609-fix-6296
Open

[grug] Fix multi-host MoE ShardingTypeError on shared-expert residual#6297
claude[bot] wants to merge 1 commit into
mainfrom
agent/20260609-fix-6296

Conversation

@claude

@claude claude Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

DenseMLP (shared expert) reshards its output on the flat [T, D] tensor via the einsum out_sharding and reshapes after, while MoEMLP (routed) reshapes first and reshards after. Splitting the fused (replica_dcn, data, expert) token axis back into (batch, seq) leaks the expert axis onto the seq dim, so the shared-expert residual add at model.py:467 disagrees with the routed path once replica_dcn > 1. On a single host the two layouts coincide, which is why single-node and TPU canary runs pass.

Reshard DenseMLP's output after the reshape so it carries the same canonical batch sharding as the routed output before the residual add. The reshard is a no-op on a trivial mesh, so single-host behavior is unchanged.

Adds a lowering regression test on a replica_dcn=2 abstract mesh (the canary regime: expert axis size 1, two replicas), which reproduces the exact ShardingTypeError before the fix.

Fixes #6296

DenseMLP (shared expert) reshards its output on the flat [T, D] tensor and
reshapes after, while MoEMLP (routed) reshapes first and reshards after.
Splitting the fused (replica_dcn, data, expert) token axis back into
(batch, seq) leaks the expert axis onto the seq dim, so the shared-expert
residual add disagrees with the routed path once replica_dcn > 1. On a single
host the two layouts coincide, which is why single-node and TPU canary runs
pass.

Reshard DenseMLP's output after the reshape so it carries the same canonical
batch sharding as the routed output before the residual add.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[grug] Multi-host MoE training fails with ShardingTypeError on shared-expert residual add

0 participants