feat: add Muon optimizer with distributed sharding support by xxyux · Pull Request #78335 · PaddlePaddle/Paddle

xxyux · 2026-03-17T08:38:36Z

PR Category

Execute Infrastructure

PR Types

New features

Description

Add Muon optimizer implementation with Newton-Schulz orthogonalization for distributed training:

Muon optimizer (python/paddle/optimizer/muon.py):
- Newton-Schulz iteration for orthogonal gradient updates
- QKV split modes: per_head, qkv_sep, full
- FFN gate_up split support
- Multiple NS coefficient types: simple, quintic, polar_express, aol
MuonShardingOptimizer:
- Whole-tensor assignment for 2D parameters (Muon)
- Element-wise sharding for non-2D parameters (AdamW)
- Hybrid memory balancing across ranks
Test coverage:
- All 24 parameter combinations tested
- 2-GPU sharding validation against single-GPU reference

是否引起精度变化

否

paddle-bot · 2026-03-17T08:38:43Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

xxyux · 2026-03-17T14:16:24Z

/re-run all-failed

xxyux · 2026-03-18T05:04:52Z

/re-run all-failed

GuoxiaWang

按照修改先改一下

xxyux · 2026-03-23T14:21:36Z

/re-run all-failed

xxyux · 2026-03-23T14:23:39Z

/re-run all-failed

xxyux · 2026-03-24T07:00:42Z

/re-run all-failed

xxyux · 2026-03-24T09:09:45Z

/re-run all-failed

xxyux · 2026-03-24T13:29:33Z

/re-run all-failed

xxyux · 2026-03-25T02:26:52Z

/re-run all-failed

Add Muon optimizer implementation with Newton-Schulz orthogonalization for distributed training: - Muon optimizer (python/paddle/optimizer/muon.py): - Newton-Schulz iteration for orthogonal gradient updates - QKV split modes: per_head, qkv_sep, full - FFN gate_up split support - Multiple NS coefficient types: simple, quintic, polar_express, aol - MuonShardingOptimizer: - Whole-tensor assignment for 2D parameters (Muon) - Element-wise sharding for non-2D parameters (AdamW) - Hybrid memory balancing across ranks - Test coverage: - All 24 parameter combinations tested - 2-GPU sharding validation against single-GPU reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

xxyux · 2026-03-25T06:35:18Z

/re-run all-failed

From00

LGTM

MuonShardingOptimizer is a dygraph-only optimizer designed for Muon optimizer with distributed sharding support. It should not be loaded by the static graph meta optimizer factory, similar to: - HybridParallelOptimizer - HeterParallelOptimizer - DGCMomentumOptimizer This fix prevents static graph tests (e.g., test_static_model_parallel, test_raw_program_optimizer) from crashing when MuonShardingOptimizer tries to access dynamic-graph-only attributes like _parameter_list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

xxyux · 2026-04-01T10:51:05Z

/re-run all-failed

codecov-commenter · 2026-04-01T11:39:09Z

Codecov Report

❌ Patch coverage is 12.88483% with 764 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@0b2356c). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...d/fleet/meta_optimizers/muon_sharding_optimizer.py	7.64%	495 Missing ⚠️
python/paddle/optimizer/muon.py	19.09%	267 Missing ⚠️
...ers/dygraph_optimizer/hybrid_parallel_optimizer.py	71.42%	2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (12.88%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop   #78335   +/-   ##
==========================================
  Coverage           ?   12.88%           
==========================================
  Files              ?        7           
  Lines              ?      877           
  Branches           ?        0           
==========================================
  Hits               ?      113           
  Misses             ?      764           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add MLAInfo dataclass and MLA split_head orthogonal update in muon.py - Add clear_param_storage/reset_param_storage methods in MuonShardingOptimizer - Support MoE expert param storage management via _color_to_comm_buffer_list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

xxyux · 2026-04-02T04:56:35Z

/re-run all-failed

xxyux · 2026-04-02T06:19:49Z

/re-run all-failed

…l parameter list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

xxyux · 2026-04-07T08:01:04Z

/re-run all-failed

xxyux · 2026-04-07T08:39:01Z

/re-run all-failed

…grad - Use MixPrecisionLayer + MixPrecisionOptimizer pattern so params have main_grad before MuonShardingOptimizer init, enabling the safe main_grad path in the refactored clear_grad (which now iterates over all parameters instead of only 2D params) - Add paddle.amp.auto_cast in train_batch for proper BF16 forward pass - Use np.random.randn for weight init (zero-centered, better NS stability) - Cast params to float32 before numpy comparison to avoid BF16 uint16 bit-pattern comparison issues Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

xxyux · 2026-04-07T15:16:30Z

/re-run all-failed

GuoxiaWang

先合入，但是有以下几个TODOs：
1、把更新规则提取到框架外
2、仅支持 GQA 和 MLA，其他 Attention 不支持，方法不够通用
3、AdamW 参数的参数识别，只用了简单的字符串 in 操作， pattern.lower() in name_lower，不够通用

xxyux · 2026-04-14T04:54:17Z

/re-run all-failed

xxyux · 2026-04-14T04:55:14Z

/re-run all-failed

xxyux · 2026-04-14T12:30:49Z

TODOs:

Fetch the hybrid communication group from color instead of paddle.distributed.
Submit the FusionStorage support in a subsequent PR after the zcc feature is verified.
Extract the update rules out of the Paddle framework.

sneaxiy

LGTM for coverage due to lack of BF16.

xxyux force-pushed the xxyux/muon-dev branch 2 times, most recently from ee06d0f to 9d17862 Compare March 17, 2026 12:18

GuoxiaWang requested changes Mar 23, 2026

View reviewed changes

Comment thread python/paddle/optimizer/muon.py Outdated

Comment thread python/paddle/optimizer/muon.py Outdated

Comment thread python/paddle/optimizer/muon.py

Comment thread python/paddle/optimizer/muon.py Outdated

xxyux force-pushed the xxyux/muon-dev branch 2 times, most recently from dc310a7 to c20b962 Compare March 23, 2026 13:59

xxyux force-pushed the xxyux/muon-dev branch from c20b962 to 545771f Compare March 24, 2026 12:02

xxyux force-pushed the xxyux/muon-dev branch from 545771f to a481ea2 Compare March 25, 2026 05:10

xxyux force-pushed the xxyux/muon-dev branch from a481ea2 to 21661c4 Compare March 25, 2026 05:17

From00 previously approved these changes Mar 31, 2026

View reviewed changes

GuoxiaWang approved these changes Mar 31, 2026

View reviewed changes

GuoxiaWang previously approved these changes Mar 31, 2026

View reviewed changes

xxyux dismissed stale reviews from GuoxiaWang and From00 via 78788e9 April 1, 2026 07:49

xxyux force-pushed the xxyux/muon-dev branch from 78788e9 to 37c8f1a Compare April 1, 2026 07:51

GuoxiaWang previously approved these changes Apr 1, 2026

View reviewed changes

xxyux dismissed GuoxiaWang’s stale review via fdb91ec April 2, 2026 04:31

fix: refactor clear_grad to handle tensor_fusion and iterate over ful…

189fe41

…l parameter list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GuoxiaWang previously approved these changes Apr 8, 2026

View reviewed changes

xxyux dismissed GuoxiaWang’s stale review via dc5343c April 13, 2026 13:03

xxyux force-pushed the xxyux/muon-dev branch from dc5343c to cd2b0ff Compare April 14, 2026 03:13

xiaoguoguo626807 added the skip-ci: slice label Apr 14, 2026

optimization for reduce-gradient by applying comm_buffer to 2d params

ebd9fe3

xxyux force-pushed the xxyux/muon-dev branch from cd2b0ff to ebd9fe3 Compare April 14, 2026 12:15

github-actions bot removed the skip-ci: slice label Apr 14, 2026

xxyux mentioned this pull request Apr 14, 2026

Add Muon Optimizer [cherry-pick from dev] #78679

Merged

GuoxiaWang approved these changes Apr 14, 2026

View reviewed changes

sneaxiy approved these changes Apr 14, 2026

View reviewed changes

swgu98 added the skip-ci: all label Apr 15, 2026

swgu98 merged commit d349ab4 into PaddlePaddle:develop Apr 15, 2026
145 of 151 checks passed

Conversation

xxyux commented Mar 17, 2026

PR Category

PR Types

Description

是否引起精度变化

Uh oh!

paddle-bot bot commented Mar 17, 2026

Uh oh!

xxyux commented Mar 17, 2026

Uh oh!

xxyux commented Mar 18, 2026

Uh oh!

GuoxiaWang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xxyux commented Mar 23, 2026

Uh oh!

xxyux commented Mar 23, 2026

Uh oh!

xxyux commented Mar 24, 2026

Uh oh!

xxyux commented Mar 24, 2026

Uh oh!

xxyux commented Mar 24, 2026

Uh oh!

xxyux commented Mar 25, 2026

Uh oh!

xxyux commented Mar 25, 2026

Uh oh!

From00 left a comment

Choose a reason for hiding this comment

Uh oh!

xxyux commented Apr 1, 2026

Uh oh!

codecov-commenter commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xxyux commented Apr 2, 2026

Uh oh!

xxyux commented Apr 2, 2026

Uh oh!

xxyux commented Apr 7, 2026

Uh oh!

xxyux commented Apr 7, 2026

Uh oh!

xxyux commented Apr 7, 2026

Uh oh!

GuoxiaWang left a comment

Choose a reason for hiding this comment

Uh oh!

xxyux commented Apr 14, 2026

Uh oh!

xxyux commented Apr 14, 2026

Uh oh!

xxyux commented Apr 14, 2026

TODOs:

Uh oh!

sneaxiy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

codecov-commenter commented Apr 1, 2026 •

edited

Loading