Skip to content

Commit 5710a74

Browse files
committed
[Feature] support ep and enhance step latency for muon
- Remove distributed_mesh parameter; extract device mesh and process group directly from DTensor metadata to support heterogeneous meshes (ViT 1D + LM 2D). - Pre-compute adjust_lr ratios in __init__ based on global (unsharded) shape, avoiding incorrect shape references inside async tasks after communication. - Add MoE expert-parallel support: per-expert Newton-Schulz orthogonalization, requires n_experts % ep_size == 0 on the EP dimension. On the FSDP dimension, skip communication when n_experts % fsdp_size == 0 (each rank holds complete experts), use sub-group all-gather when fsdp_size % n_experts == 0, otherwise fall back to batched all-to-all. - Add AGRS (All-Gather + Reduce-Scatter) path for remainder batches to avoid zero-padding overhead, with even-sharding guard to prevent deadlock. - Refactor shared utilities (group_tensors_by_device_mesh_and_placements, cal_total_norm) from grad_norm.py to dtensor.py. - Remove `# type: ignore` from file head, fix lint, add full type annotations.
1 parent c0d0476 commit 5710a74

6 files changed

Lines changed: 822 additions & 442 deletions

File tree

xtuner/v1/config/optim.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -185,14 +185,9 @@ def build(self, model):
185185
f"Muon params: {num_muon_regular / 1e6:.2f}M, AdamW params: {num_adamw / 1e6:.2f}M (counts by numel)"
186186
)
187187
logger.info(f"Untrainable parameters names: {untrainable_names}")
188-
logger.info(
189-
f"using Muon optimizer distributed_mesh_size: {model.fsdp_mesh.size()}, "
190-
f"distributed_mesh: {model.fsdp_mesh}"
191-
)
192188

193189
optimizer = Muon(
194190
param_groups,
195-
distributed_mesh=model.language_model.fsdp_mesh, # TODO: 暂不支持 EP>1; maybe rm device_mesh dependency?
196191
lr=self.lr,
197192
mu=self.momentum,
198193
betas=self.betas,

0 commit comments

Comments
 (0)