Skip to content

feat[Muon]: migrate slice logic to caller, support fused storage, and generalize color-group management#78716

Merged
GuoxiaWang merged 1 commit intoPaddlePaddle:developfrom
xxyux:xxyux/muon-dev
Apr 21, 2026
Merged

feat[Muon]: migrate slice logic to caller, support fused storage, and generalize color-group management#78716
GuoxiaWang merged 1 commit intoPaddlePaddle:developfrom
xxyux:xxyux/muon-dev

Conversation

@xxyux
Copy link
Copy Markdown
Contributor

@xxyux xxyux commented Apr 18, 2026

PR Category

Execute Infrastructure

PR Types

Improvements

Description

This PR makes two sets of changes to the Muon distributed optimizer:

  1. Generalize color-group handling in MuonShardingOptimizer

Previously, MuonShardingOptimizer hardcoded two color paths (None for the default sharding group and moe_expert for MoE experts). Adding any new parameter group required modifying the optimizer internals.

Replace _build_color_to_group_info(hcg) (static, hardcoded) with _build_color_to_group_info_from_params(parameter_list, default_group) that dynamically scans param.color dicts at runtime — any new color is picked up automatically without code changes
Generalize step(), init (local_opt_params), reduce_gradients, and _sharding_sync_parameters to iterate _rank2params_2d_by_color instead of separate hardcoded loops
Clean up comments: fix incorrect descriptions, translate Chinese to English, remove dead code and debug prints
2. Move QKV/FFN split logic out of the optimizer; add V100 fp32 matmul fallback

Remove built-in QKV/FFN split logic (QKVInfo, qkv_info, intermediate_size, muon_qkv_update_mode, muon_ffn_split) from the optimizer core; model-specific slice strategies are now passed via MuonParamInfo.slice_func by the caller
Add ns_matmul_dtype parameter to Muon.init and _zeropower_via_newtonschulz5: auto-detects bfloat16 on Ampere+ (capability ≥ 8.0) and falls back to float32 on V100 and older, enabling CI on V100

是否引起精度变化

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 18, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Comment thread python/paddle/optimizer/muon.py Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 18, 2026

Codecov Report

❌ Patch coverage is 93.54839% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@bae4558). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...d/fleet/meta_optimizers/muon_sharding_optimizer.py 94.00% 3 Missing ⚠️
python/paddle/optimizer/muon.py 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #78716   +/-   ##
==========================================
  Coverage           ?   93.54%           
==========================================
  Files              ?        3           
  Lines              ?       62           
  Branches           ?        0           
==========================================
  Hits               ?       58           
  Misses             ?        4           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…mpatibility

- muon_sharding_optimizer: replace hardcoded None/moe_expert color paths with
  generic _rank2params_2d_by_color iteration in step() and __init__ Step4;
  replace static _build_color_to_group_info(hcg) with dynamic
  _build_color_to_group_info_from_params(parameter_list, default_group) that
  scans param.color dicts at runtime; generalize reduce_gradients and
  _sharding_sync_parameters similarly; clean up comments (fix errors, translate
  Chinese to English, remove dead code and debug prints)

- muon: remove built-in QKV/FFN split logic (QKVInfo, qkv_info,
  intermediate_size, muon_qkv_update_mode, muon_ffn_split) from the optimizer;
  callers now pass slice strategies via MuonParamInfo.slice_func, keeping
  model-specific split logic out of the optimizer core; add ns_matmul_dtype
  parameter to Muon.__init__ and _zeropower_via_newtonschulz5 with auto-detect
  (bfloat16 on Ampere+, float32 on V100 and older) to enable CI on V100

- optimizer: allow Muon class to skip incompatible base-class checks

- test: update hybrid_parallel_sharding_muon_model and test_parallel_dygraph_muon
  to use current MuonParamInfo API (slice_func instead of deprecated qkv_info/
  intermediate_size); remove GPU capability >= 8 skipIf guard so tests run on V100

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@xxyux
Copy link
Copy Markdown
Contributor Author

xxyux commented Apr 19, 2026

/re-run all-failed

@GuoxiaWang GuoxiaWang merged commit 03af577 into PaddlePaddle:develop Apr 21, 2026
102 of 105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants