Skip to content

fix(dummy_loader): always create moe_mesh for expert-sharded weights#1028

Open
JamesBrianD wants to merge 1 commit intosgl-project:mainfrom
primatrix:fix/dummy-loader-epmoe-mesh
Open

fix(dummy_loader): always create moe_mesh for expert-sharded weights#1028
JamesBrianD wants to merge 1 commit intosgl-project:mainfrom
primatrix:fix/dummy-loader-epmoe-mesh

Conversation

@JamesBrianD
Copy link
Copy Markdown
Collaborator

Summary

  • _load_dummy_weights only created a moe_mesh with ("expert", "tensor") axes when ep_size > 1, falling back to self.mesh ("data", "tensor") otherwise. This caused NamedSharding(self.mesh, P("expert", ...)) to crash because the mesh has no "expert" axis.
  • Always create moe_mesh when sharding contains "expert", matching the real weight loading paths (lines 2032 and 2145 in weight_utils.py). Also adds axis_types=Explicit consistent with other mesh constructions.

Test plan

  • Verified on v7x 2x2x2 (16 devices, 2 hosts) with Qwen3-30B-A3B --load-format dummy --tp-size=16 --nnodes=2: dummy weights load successfully and server enters precompile stage.

Fixes #1022

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@JamesBrianD JamesBrianD requested a review from Prayer3th May 6, 2026 06:53
Prayer3th
Prayer3th previously approved these changes May 6, 2026
Copy link
Copy Markdown
Collaborator

@Prayer3th Prayer3th left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The _load_dummy_weights method only created a moe_mesh with
("expert", "tensor") axes when ep_size > 1, falling back to
self.mesh ("data", "tensor") otherwise. This caused a crash because
NamedSharding with P("expert", ...) requires a mesh that has an
"expert" axis. The real weight loading paths always create moe_mesh
regardless of ep_size. Align dummy loader to match.

Fixes sgl-project#1022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Model launch fails with tp=16

2 participants