Skip to content

feat: add Muon optimizer with distributed sharding support#78335

Merged
swgu98 merged 6 commits intoPaddlePaddle:developfrom
xxyux:xxyux/muon-dev
Apr 15, 2026
Merged

feat: add Muon optimizer with distributed sharding support#78335
swgu98 merged 6 commits intoPaddlePaddle:developfrom
xxyux:xxyux/muon-dev

Conversation

@xxyux
Copy link
Copy Markdown
Contributor

@xxyux xxyux commented Mar 17, 2026

PR Category

Execute Infrastructure

PR Types

New features

Description

Add Muon optimizer implementation with Newton-Schulz orthogonalization for distributed training:

  • Muon optimizer (python/paddle/optimizer/muon.py):

    • Newton-Schulz iteration for orthogonal gradient updates
    • QKV split modes: per_head, qkv_sep, full
    • FFN gate_up split support
    • Multiple NS coefficient types: simple, quintic, polar_express, aol
  • MuonShardingOptimizer:

    • Whole-tensor assignment for 2D parameters (Muon)
    • Element-wise sharding for non-2D parameters (AdamW)
    • Hybrid memory balancing across ranks
  • Test coverage:

    • All 24 parameter combinations tested
    • 2-GPU sharding validation against single-GPU reference

是否引起精度变化

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 17, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@xxyux xxyux force-pushed the xxyux/muon-dev branch 2 times, most recently from ee06d0f to 9d17862 Compare March 17, 2026 12:18
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 17, 2026

/re-run all-failed

1 similar comment
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 18, 2026

/re-run all-failed

Copy link
Copy Markdown
Contributor

@GuoxiaWang GuoxiaWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照修改先改一下

Comment thread python/paddle/optimizer/muon.py Outdated
Comment thread python/paddle/optimizer/muon.py Outdated
Comment thread python/paddle/optimizer/muon.py
Comment thread python/paddle/optimizer/muon.py Outdated
@xxyux xxyux force-pushed the xxyux/muon-dev branch 2 times, most recently from dc310a7 to c20b962 Compare March 23, 2026 13:59
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 23, 2026

/re-run all-failed

3 similar comments
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 23, 2026

/re-run all-failed

@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 24, 2026

/re-run all-failed

@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 24, 2026

/re-run all-failed

@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 24, 2026

/re-run all-failed

1 similar comment
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 25, 2026

/re-run all-failed

Add Muon optimizer implementation with Newton-Schulz orthogonalization
for distributed training:

- Muon optimizer (python/paddle/optimizer/muon.py):
  - Newton-Schulz iteration for orthogonal gradient updates
  - QKV split modes: per_head, qkv_sep, full
  - FFN gate_up split support
  - Multiple NS coefficient types: simple, quintic, polar_express, aol

- MuonShardingOptimizer:
  - Whole-tensor assignment for 2D parameters (Muon)
  - Element-wise sharding for non-2D parameters (AdamW)
  - Hybrid memory balancing across ranks

- Test coverage:
  - All 24 parameter combinations tested
  - 2-GPU sharding validation against single-GPU reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Mar 25, 2026

/re-run all-failed

From00
From00 previously approved these changes Mar 31, 2026
Copy link
Copy Markdown
Contributor

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

GuoxiaWang
GuoxiaWang previously approved these changes Mar 31, 2026
@xxyux xxyux dismissed stale reviews from GuoxiaWang and From00 via 78788e9 April 1, 2026 07:49
MuonShardingOptimizer is a dygraph-only optimizer designed for Muon
optimizer with distributed sharding support. It should not be loaded
by the static graph meta optimizer factory, similar to:
- HybridParallelOptimizer
- HeterParallelOptimizer
- DGCMomentumOptimizer

This fix prevents static graph tests (e.g., test_static_model_parallel,
test_raw_program_optimizer) from crashing when MuonShardingOptimizer
tries to access dynamic-graph-only attributes like _parameter_list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 1, 2026

/re-run all-failed

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 12.88483% with 764 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@0b2356c). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...d/fleet/meta_optimizers/muon_sharding_optimizer.py 7.64% 495 Missing ⚠️
python/paddle/optimizer/muon.py 19.09% 267 Missing ⚠️
...ers/dygraph_optimizer/hybrid_parallel_optimizer.py 71.42% 2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (12.88%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #78335   +/-   ##
==========================================
  Coverage           ?   12.88%           
==========================================
  Files              ?        7           
  Lines              ?      877           
  Branches           ?        0           
==========================================
  Hits               ?      113           
  Misses             ?      764           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

GuoxiaWang
GuoxiaWang previously approved these changes Apr 1, 2026
- Add MLAInfo dataclass and MLA split_head orthogonal update in muon.py
- Add clear_param_storage/reset_param_storage methods in MuonShardingOptimizer
- Support MoE expert param storage management via _color_to_comm_buffer_list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 2, 2026

/re-run all-failed

1 similar comment
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 2, 2026

/re-run all-failed

…l parameter list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 7, 2026

/re-run all-failed

1 similar comment
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 7, 2026

/re-run all-failed

…grad

- Use MixPrecisionLayer + MixPrecisionOptimizer pattern so params have
  main_grad before MuonShardingOptimizer init, enabling the safe
  main_grad path in the refactored clear_grad (which now iterates over
  all parameters instead of only 2D params)
- Add paddle.amp.auto_cast in train_batch for proper BF16 forward pass
- Use np.random.randn for weight init (zero-centered, better NS stability)
- Cast params to float32 before numpy comparison to avoid BF16 uint16
  bit-pattern comparison issues

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 7, 2026

/re-run all-failed

GuoxiaWang
GuoxiaWang previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Contributor

@GuoxiaWang GuoxiaWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先合入,但是有以下几个TODOs:
1、把更新规则提取到框架外
2、仅支持 GQA 和 MLA,其他 Attention 不支持,方法不够通用
3、AdamW 参数的参数识别,只用了简单的字符串 in 操作, pattern.lower() in name_lower,不够通用

@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 14, 2026

/re-run all-failed

1 similar comment
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 14, 2026

/re-run all-failed

@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 14, 2026

TODOs:

  • Fetch the hybrid communication group from color instead of paddle.distributed.
  • Submit the FusionStorage support in a subsequent PR after the zcc feature is verified.
  • Extract the update rules out of the Paddle framework.

Copy link
Copy Markdown
Collaborator

@sneaxiy sneaxiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for coverage due to lack of BF16.

@swgu98 swgu98 merged commit d349ab4 into PaddlePaddle:develop Apr 15, 2026
145 of 151 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants