feat: add Muon optimizer with distributed sharding support#78335
feat: add Muon optimizer with distributed sharding support#78335swgu98 merged 6 commits intoPaddlePaddle:developfrom
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
ee06d0f to
9d17862
Compare
|
/re-run all-failed |
1 similar comment
|
/re-run all-failed |
dc310a7 to
c20b962
Compare
|
/re-run all-failed |
3 similar comments
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
1 similar comment
|
/re-run all-failed |
Add Muon optimizer implementation with Newton-Schulz orthogonalization for distributed training: - Muon optimizer (python/paddle/optimizer/muon.py): - Newton-Schulz iteration for orthogonal gradient updates - QKV split modes: per_head, qkv_sep, full - FFN gate_up split support - Multiple NS coefficient types: simple, quintic, polar_express, aol - MuonShardingOptimizer: - Whole-tensor assignment for 2D parameters (Muon) - Element-wise sharding for non-2D parameters (AdamW) - Hybrid memory balancing across ranks - Test coverage: - All 24 parameter combinations tested - 2-GPU sharding validation against single-GPU reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/re-run all-failed |
MuonShardingOptimizer is a dygraph-only optimizer designed for Muon optimizer with distributed sharding support. It should not be loaded by the static graph meta optimizer factory, similar to: - HybridParallelOptimizer - HeterParallelOptimizer - DGCMomentumOptimizer This fix prevents static graph tests (e.g., test_static_model_parallel, test_raw_program_optimizer) from crashing when MuonShardingOptimizer tries to access dynamic-graph-only attributes like _parameter_list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/re-run all-failed |
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (12.88%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #78335 +/- ##
==========================================
Coverage ? 12.88%
==========================================
Files ? 7
Lines ? 877
Branches ? 0
==========================================
Hits ? 113
Misses ? 764
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add MLAInfo dataclass and MLA split_head orthogonal update in muon.py - Add clear_param_storage/reset_param_storage methods in MuonShardingOptimizer - Support MoE expert param storage management via _color_to_comm_buffer_list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/re-run all-failed |
1 similar comment
|
/re-run all-failed |
…l parameter list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/re-run all-failed |
1 similar comment
|
/re-run all-failed |
…grad - Use MixPrecisionLayer + MixPrecisionOptimizer pattern so params have main_grad before MuonShardingOptimizer init, enabling the safe main_grad path in the refactored clear_grad (which now iterates over all parameters instead of only 2D params) - Add paddle.amp.auto_cast in train_batch for proper BF16 forward pass - Use np.random.randn for weight init (zero-centered, better NS stability) - Cast params to float32 before numpy comparison to avoid BF16 uint16 bit-pattern comparison issues Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/re-run all-failed |
GuoxiaWang
left a comment
There was a problem hiding this comment.
先合入,但是有以下几个TODOs:
1、把更新规则提取到框架外
2、仅支持 GQA 和 MLA,其他 Attention 不支持,方法不够通用
3、AdamW 参数的参数识别,只用了简单的字符串 in 操作, pattern.lower() in name_lower,不够通用
|
/re-run all-failed |
1 similar comment
|
/re-run all-failed |
TODOs:
|
sneaxiy
left a comment
There was a problem hiding this comment.
LGTM for coverage due to lack of BF16.
PR Category
Execute Infrastructure
PR Types
New features
Description
Add Muon optimizer implementation with Newton-Schulz orthogonalization for distributed training:
Muon optimizer (python/paddle/optimizer/muon.py):
MuonShardingOptimizer:
Test coverage:
是否引起精度变化
否