Skip to content

Add Muon Optimizer [cherry-pick from dev]#78679

Merged
sneaxiy merged 7 commits intoPaddlePaddle:release/3.3from
xxyux:release/3.3
Apr 15, 2026
Merged

Add Muon Optimizer [cherry-pick from dev]#78679
sneaxiy merged 7 commits intoPaddlePaddle:release/3.3from
xxyux:release/3.3

Conversation

@xxyux
Copy link
Copy Markdown
Contributor

@xxyux xxyux commented Apr 14, 2026

PR Category

Execute Infrastructure

PR Types

New features

Description

Add Muon Optimizer
devPR:#78335

是否引起精度变化

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 14, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

xxyux and others added 6 commits April 14, 2026 20:57
Add Muon optimizer implementation with Newton-Schulz orthogonalization
for distributed training:

- Muon optimizer (python/paddle/optimizer/muon.py):
  - Newton-Schulz iteration for orthogonal gradient updates
  - QKV split modes: per_head, qkv_sep, full
  - FFN gate_up split support
  - Multiple NS coefficient types: simple, quintic, polar_express, aol

- MuonShardingOptimizer:
  - Whole-tensor assignment for 2D parameters (Muon)
  - Element-wise sharding for non-2D parameters (AdamW)
  - Hybrid memory balancing across ranks

- Test coverage:
  - All 24 parameter combinations tested
  - 2-GPU sharding validation against single-GPU reference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MuonShardingOptimizer is a dygraph-only optimizer designed for Muon
optimizer with distributed sharding support. It should not be loaded
by the static graph meta optimizer factory, similar to:
- HybridParallelOptimizer
- HeterParallelOptimizer
- DGCMomentumOptimizer

This fix prevents static graph tests (e.g., test_static_model_parallel,
test_raw_program_optimizer) from crashing when MuonShardingOptimizer
tries to access dynamic-graph-only attributes like _parameter_list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…grad

- Use MixPrecisionLayer + MixPrecisionOptimizer pattern so params have
  main_grad before MuonShardingOptimizer init, enabling the safe
  main_grad path in the refactored clear_grad (which now iterates over
  all parameters instead of only 2D params)
- Add paddle.amp.auto_cast in train_batch for proper BF16 forward pass
- Use np.random.randn for weight init (zero-centered, better NS stability)
- Cast params to float32 before numpy comparison to avoid BF16 uint16
  bit-pattern comparison issues

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l parameter list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add MLAInfo dataclass and MLA split_head orthogonal update in muon.py
- Add clear_param_storage/reset_param_storage methods in MuonShardingOptimizer
- Support MoE expert param storage management via _color_to_comm_buffer_list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 14, 2026

/re-run all-failed

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 12.99886% with 763 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/3.3@2115d0a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...d/fleet/meta_optimizers/muon_sharding_optimizer.py 7.64% 495 Missing ⚠️
python/paddle/optimizer/muon.py 19.09% 267 Missing ⚠️
...ers/dygraph_optimizer/hybrid_parallel_optimizer.py 85.71% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (12.99%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/3.3   #78679   +/-   ##
==============================================
  Coverage               ?   12.99%           
==============================================
  Files                  ?        7           
  Lines                  ?      877           
  Branches               ?        0           
==============================================
  Hits                   ?      114           
  Misses                 ?      763           
  Partials               ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@sneaxiy sneaxiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for coverage due to lack of BF16.

@sneaxiy sneaxiy merged commit a12dc4d into PaddlePaddle:release/3.3 Apr 15, 2026
134 of 158 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants