Commit 5710a74
committed
[Feature] support ep and enhance step latency for muon
- Remove distributed_mesh parameter; extract device mesh and process
group
directly from DTensor metadata to support heterogeneous meshes (ViT 1D
+ LM 2D).
- Pre-compute adjust_lr ratios in __init__ based on global (unsharded)
shape,
avoiding incorrect shape references inside async tasks after
communication.
- Add MoE expert-parallel support: per-expert Newton-Schulz
orthogonalization,
requires n_experts % ep_size == 0 on the EP dimension. On the FSDP
dimension,
skip communication when n_experts % fsdp_size == 0 (each rank holds
complete
experts), use sub-group all-gather when fsdp_size % n_experts == 0,
otherwise
fall back to batched all-to-all.
- Add AGRS (All-Gather + Reduce-Scatter) path for remainder batches to
avoid
zero-padding overhead, with even-sharding guard to prevent deadlock.
- Refactor shared utilities
(group_tensors_by_device_mesh_and_placements,
cal_total_norm) from grad_norm.py to dtensor.py.
- Remove `# type: ignore` from file head, fix lint, add full type
annotations.1 parent c0d0476 commit 5710a74
6 files changed
Lines changed: 822 additions & 442 deletions
File tree
- xtuner/v1
- config
- optim
- utils
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
185 | 185 | | |
186 | 186 | | |
187 | 187 | | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
192 | 188 | | |
193 | 189 | | |
194 | 190 | | |
195 | | - | |
196 | 191 | | |
197 | 192 | | |
198 | 193 | | |
| |||
0 commit comments