The first version for PaddleFleet.
What's Changed
- Add preliminary code by @From00 in #1
- add spec utils by @FeixLiu in #2
- add enums.py and identity_op.py by @GuoxiaWang in #3
- add vpp_simulator by @Waynezee in #5
- PaddleFleet distributed initialization and ProcessGroup create. by @Hz188 in #8
- add codestyle workflow by @swgu98 in #6
- Complete parallel_state.py by @Hz188 in #12
- Trans transformer block/layer by @FeixLiu in #11
- [CodeStyle] Ignore
PLC0414in__init__.pyfiles by @SigureMo in #13 - Complete process_groups_config.py doc and fix typo by @Hz188 in #15
- [setup] Support source installation of Paddle-Fleet by @risemeup1 in #14
- improve parallel_state.py for EPHcg & Hcg by @Hz188 in #16
- [CI] Add approval workflow by @swgu98 in #10
- support pipeline_parallel schedules by @AlAuAu in #9
- dev global vars, yaml parse by @Hz188 in #18
- [CI] fix approval workflow by @swgu98 in #20
- Add Attention and RoPE classes by @lshpku in #17
- add mtp by @FeixLiu in #22
- unittest test_schedules bugfix by @AlAuAu in #24
- add gpt_model specs by @GuoxiaWang in #23
- add mlp layer by @blacksheep-Aristotle in #26
- Trans ln by @FeixLiu in #25
- Add PackedSeqParams class by @lshpku in #30
- add psp by @FeixLiu in #31
- Update LanguageModelEmbedding and add unittest by @lshpku in #27
- Update RoPE and add unittest by @lshpku in #28
- [CI] add test workflow by @swgu98 in #19
- add pipeline/utils.py by @blacksheep-Aristotle in #33
- [CI] change ut dir by @swgu98 in #34
- [CI] git name by @swgu98 in #35
- Placeholder for building transformer configuration parsing by @Hz188 in #32
- [CI] Add typos white list by @swgu98 in #36
- change sublayer to sublayer_spec by @FeixLiu in #39
- fix some test_gpt_model dependencies by @GuoxiaWang in #40
- Add global Timers for logging by @huangjiyi in #21
- add check_initialized for dp group by @Hz188 in #41
- add coverage scripts by @XieYunshen in #38
- fix coverage scripts by @XieYunshen in #45
- First successful run of GPTModel model definition by @GuoxiaWang in #44
- Change fleet core to paddlefleet by @From00 in #46
- fix coverage bug by @risemeup1 in #47
- Some fixes to successful run glm4.5 in PaddleFormers by @From00 in #49
- single card test by @swgu98 in #50
- mv paddlefleet to src by @risemeup1 in #52
- use paddle12.6 in single card test by @risemeup1 in #53
- Add GPTModelEstimator by @huangjiyi in #59
- Fix attention dim order by @lshpku in #60
- fix ffn_hidden_size is None when init gpt_mlp by @blacksheep-Aristotle in #56
- Add estimate_mfu by @huangjiyi in #62
- refine pyproject.toml by @risemeup1 in #63
- [CI] add uv pre-commit by @swgu98 in #65
- [CI] Add nemo megatron approval by @swgu98 in #51
- Add set_logging and get_logger by @huangjiyi in #64
- bugfix init GPTModel by @GuoxiaWang in #54
- [CodeStyle][Ruff] update ruff target-version to
py310by @ooooo-create in #66 - Supprot custom op by @zhangbo9674 in #48
- 【lora】fix model for Lora by @xiaoguoguo626807 in #68
- Ci/multi card config by @swgu98 in #67
- [CI] add approval for model_parallel_config.py & transformer_config.py by @swgu98 in #70
- [MoE] Add Base MoE Layer by @hushenwei2000 in #61
- Delete sharded_state_dict to support FC save/load by @changeyoung98 in #71
- Align to PaddleFormers by @Waynezee in #72
- Ignore files generated from
uv syncfor custom ops by @ooooo-create in #69 - [Bug_Fix] fix attention_mask & skip check expert_tensor_parallel_group by @xuxinyi389 in #73
- Fix moe_layer config by @From00 in #74
- Fix save_tensors bugs and disable jit by @From00 in #75
- Add tensor parallel functions by @pkuzyc in #29
- Support TP Sharding EP For GLM4.5 by @xuxinyi389 in #76
- move spec utils to paddlefleet by @FeixLiu in #78
- Cherry pp layers by @FeixLiu in #80
- add non pipeline execution by @LiYuRio in #81
- Shared weight test by @FeixLiu in #86
- modify pylayer bug by @xiaoguoguo626807 in #87
- refine non-pp scheduler by @LiYuRio in #89
- Support MTP in GLM4.5 and add unittest by @lshpku in #55
- Use original cross_entropy and re-open the loss check in unit test by @pkuzyc in #84
- Fix rope dim order by @lshpku in #91
- support PipelineParallel by @AlAuAu in #92
- fix single card run by @huangjiyi in #90
- Fix bug in tensor_parallel unit tests by @pkuzyc in #93
- [MoE Layer] Fix EP Hang when No Tokens are Distributed in the Rank by @hushenwei2000 in #83
- pp License fix by @AlAuAu in #95
- [CI] add integration test glm by @swgu98 in #85
- Add sharded_state_dict for TP by @changeyoung98 in #94
- [CI] fix bypass by @swgu98 in #97
- Add instructions for copilot reviewer by @risemeup1 in #96
- [Feature] Add test instruction by @risemeup1 in #98
- disable test_layers.py by @swgu98 in #99
- [CI] Delete sed by @swgu98 in #101
- rename config fields to align huggingface by @Hz188 in #82
- Fix bias grad reduction of bias_geglu_back by @lshpku in #100
- fix config by @Waynezee in #108
- support pipeline_parallel_withinterleave by @AlAuAu in #102
- [Feature] Add nightly wheel publishing workflow by @swgu98 in #107
- [CI] Remove redundant AK/SK exports in nightly publish workflow by @swgu98 in #115
- suppport PipelineParallelWithInterleaveFthenB and VPPFhenBInBalancedMemory by @AlAuAu in #113
- turn off deepep on ampere and fix logging by @huangjiyi in #109
- add llava_model and clip_vit model by @blacksheep-Aristotle in #105
- support distributed_model by @AlAuAu in #111
- fix deterministic by @Waynezee in #116
- 【modelconfig】Change model layer name to support hf model by @xiaoguoguo626807 in #118
- support fp8 fusion node by @deepllz in #114
- Move sdpa before kv broadcast by @lshpku in #121
- Support fuse rope by @xuxinyi389 in #117
- model_config_and_dpo_support. by @wtmlon in #106
- Fix bugs in vocab_parallel_cross_entropy and VocabParallelEmbedding by @pkuzyc in #104
- Change name 2 by @xiaoguoguo626807 in #122
- Sequence parallel for GPTModel by @pkuzyc in #125
- Refine custom ops compile by @zhangbo9674 in #126
- add single card test and a100 test by @huangjiyi in #124
- Use Abi3 for building whl by @risemeup1 in #128
- Add setup test by @risemeup1 in #133
- add config by @Waynezee in #120
- add cp for paddlefleet by @Wennie396 in #129
- add coverge by @tianlef in #131
- Fix sharded_state_dict for single card by @changeyoung98 in #135
- fix numel block cpu by @huangjiyi in #136
- [CI] Add PR paddle wheel by @swgu98 in #137
- [CI]fix_uv_sync by @tianlef in #138
- Fix bugs in sequence parallel and add unit test by @pkuzyc in #139
- [CI] Revert paddleformers commit for integration test by @swgu98 in #140
- [Refactor] Split tokens_stable_unzip.cu into modular CUDA files by @ooooo-create in #141
- 【fused_moe】fix Moe fp8_utils.py bwd by @xiaoguoguo626807 in #142
- support matmul_bwd by @xuxinyi389 in #134
- Add dedicated FusedRMSNorm class by @lshpku in #147
- [CI] Add customop approval in
ci/check_approval.shby @ooooo-create in #145 - 【fp8】expert weight stop gradient = True can't apply_backward_hook by @xiaoguoguo626807 in #149
- [Pipeline Parallel] support pipeline parallel for gpt model by @LiYuRio in #112
- [CI] glm45 a100 by @swgu98 in #154
- [CI] add flags by @swgu98 in #155
- Support DeepEPTopKRouter by @xuxinyi389 in #146
- Gpt pp ut by @FeixLiu in #156
- [CI] Add qwen precision & Update CI by @swgu98 in #162
- [CI] Add version for wheel by @swgu98 in #163
- 【model name】update ppmodel state_dict name by @xiaoguoguo626807 in #160
- [CI] single card test on h20 by @swgu98 in #167
- GLM multi card test by @xuxinyi389 in #166
- Support fuse_swiglu_scale by @xuxinyi389 in #164
- add attn_mask_startend_row_indices for flashmask by @Wennie396 in #159
- 【config, pp】delete pipeline_dtype ; add model func by @xiaoguoguo626807 in #169
- Clean some useless code by @ooooo-create in #150
- [CI] Update config name by @swgu98 in #174
- [MoE Layer] Add BF16 GroupedGEMM and Unit Tests by @hushenwei2000 in #127
- [2025-12-11-17:21] Bump
uv.lockby @ooooo-create in #173 - fix cp bugs and add unit test for context parallel by @Wennie396 in #144
- Precision Change by @Waynezee in #184
- Add recompute by @Waynezee in #178
- add fp8_dispatch && shared_expert_overlap && offline quant by @Waynezee in #158
- Fix DeepEPTopKRouter for sp by @From00 in #186
- Support GLM45 with pipeline parallel by @LiYuRio in #168
- Move
paddlefleet.extensions.opstopaddlefleet.opsby @ooooo-create in #176 - [CI] Add
Merge PR to test branchtoApprovalworkflow and fix known-first-party inpyproject.tomlby @ooooo-create in #190 - [CI] add
rerunworkflow by @ooooo-create in #180 - [CI]incremental coverage by @tianlef in #157
- cache cos and sin for rope by @huangjiyi in #153
- [CI]change loss by @tianlef in #194
- [DeepGEMM] Support
DeepGEMMas a submodule by @ooooo-create in #191 - add empty layer by @FeixLiu in #189
- [Compat] Add triton to torch_proxy scope by @ooooo-create in #201
- Update
.github/actions/check-bypass/action.ymlby @ooooo-create in #202 - [DeepGEMM] Fix deep_gemm install by @ooooo-create in #203
- [CI] change to cli by @swgu98 in #198
- add_recompute_modules by @Waynezee in #196
- [CI]find error for log by @tianlef in #200
- [3rdparty] add check for uninitialized submodules by @ooooo-create in #204
- bug fix for moe by @FeixLiu in #199
- Revert "[CI]find error for log" by @swgu98 in #210
- fix by @swgu98 in #208
- [CI]a100 case add: gated_linear_unit: true by @tianlef in #212
- [CI]fix ci config for cli by @tianlef in #214
- [Infra] Add
instructionsfor faster local dev and removecpplint, clang-formatlocal hooks by @ooooo-create in #187 - 【Lora】fix lora pylayer bug by @xiaoguoguo626807 in #220
- 增加增量覆盖率信息打印 by @XieYunshen in #193
- [Pipeline Parallel] NoPipelineParallel bugfix by @AlAuAu in #197
- [CI] add sft+lora by @swgu98 in #216
- fix recompute by @Waynezee in #221
- Bump
uv.lockby @ooooo-create in #177 - [CI] Add new workflow to auto update
uv.lockby @ooooo-create in #183 - [CI] add moe_router_force_load_balancing by @swgu98 in #228
- [DeepEP] Add
DeepEPas a submodule by @ooooo-create in #215 - [BugFix] Fix update_dependencies.yml with limited disk space by @ooooo-create in #233
- [CI] Add
reopenedactivity to triggerpull_requestevent inApproval.ymlby @ooooo-create in #236 - [CI]fix config for pretrain memory error by @tianlef in #231
- add dict feature in function eval_batch & rename empty layer config by @Hz188 in #222
- [CI]change loss by @tianlef in #238
- [CI]change config by @tianlef in #244
- [Compat] Refine
paddle.compat.enable_torch_proxyusage by @ooooo-create in #243 - [CI] deal exit code 250 by @tianlef in #209
- update precision by @swgu98 in #245
- 【】delete Random warning only print once by @xiaoguoguo626807 in #247
- support fused_swiglu_bwd by @xuxinyi389 in #239
- pp model support dpo. by @wtmlon in #181
- [CI]fix exit code of pt log file by @tianlef in #249
- [MoE Layer] Add Grouped GEMM Fused Expert Weights Version by @hushenwei2000 in #175
- unify subbatch by @xuxinyi389 in #240
- [CI] add release3.3 paddle by @swgu98 in #255
- [CI] add release3.3 single card by @swgu98 in #256
- [CI] change shell to formers by @swgu98 in #258
- [bugfix] fix pp empty layer config bug by @Hz188 in #259
- Formalize deep_gemm unittests by @A-nnonymous in #250
- fix lora bug by @xiaoguoguo626807 in #261
- Support rrattnention in flashmask by @LLSGYN in #227
- fix_recompute_fused_rope by @huangjiyi in #264
- Fix loss diff for distributed strategies by @changeyoung98 in #254
- open fusion of swiglu by @xuxinyi389 in #251
- TopKRouter by @xuxinyi389 in #260
- Reduce GLM memory consumption by @zhangting2020 in #266
- [CI] del nemo megatron by @swgu98 in #275
- [CI] add qwen3moe by @swgu98 in #273
- [CI]Add glm dpo && coverage change by @tianlef in #274
- [CI] Grouped GEMM Intergrated Test by @hushenwei2000 in #277
- fix flash_mask_cp by @Wennie396 in #219
- [BugFix] Add nvidia-nvshmem-cu12 limit to avoid multiple definitions by @ooooo-create in #285
- [MoE Layer] Implement barrier_ep for Synchronization by @hushenwei2000 in #272
- fix cp fused_rope by @Wennie396 in #278
- Fix TransToDataType dtype cast error by @sneaxiy in #290
- chore 🤖: Bump
uv.lock(2026-01-04) by @github-actions[bot] in #291 - bug fix by @FeixLiu in #288
- Add sharded_state_dict for group_gemm by @changeyoung98 in #279
- remove unuse operations and disable sequence_parallel when tp <= 1 by @Waynezee in #289
- [3rdparty][DeepEP] Bump DeepEP by @ooooo-create in #299
- [CI] single card unittest use uv build by @swgu98 in #296
- [3rdparty][DeepEP] Bump DeepEP by @ooooo-create in #300
- [CI] precision test by @swgu98 in #295
- [MoE Layer] Fix Deep GEMM k_group Kernel Calling by @hushenwei2000 in #305
- [CI] install dependences of paddlefleet with cache by @swgu98 in #306
- [Sonicmoe] Add Sonicmoe as a submodule by @ooooo-create in #287
- [CI]Fix exit code check logit for multi card unit test by @tianlef in #303
- use uv build --wheel by @ooooo-create in #317
- chore 🤖: Bump
uv.lock(2026-01-06) by @github-actions[bot] in #313 - align config by @Waynezee in #304
- fix cp unittest by @Wennie396 in #307
- Add
check_patchelf_existsand bump sonic-moe by @ooooo-create in #326 - fix seq_aux_loss by @xuxinyi389 in #318
- [CI] update precision method by @swgu98 in #315
- [MoE Layer] Fix Router topk_weigtht in noaux_tc Method by @hushenwei2000 in #329
- [Feature] Add dynamic CUDA version-based dependency resolution by @ooooo-create in #293
- [CI]add cpu compile by @tianlef in #328
- [CI] coverage change to release by @swgu98 in #334
- [CI]disable multi card by @tianlef in #335
- tokens_unzip_gather support ue8m0 by @DanielSun11 in #310
- [CI] coverage by @swgu98 in #336
- Qwen3 vl by @blacksheep-Aristotle in #323
- [Build] Add git hash by @ooooo-create in #333
- [CI]fix coverage by @tianlef in #340
- [Build] Remove .o files from wheel before packaging by @ooooo-create in #330
- [fix]GLM45 pretrain fp8 on cuda126 by @tianlef in #342
- [MoE Layer] Support deepgemm Padding to tile_M by @hushenwei2000 in #282
- fix ut by @Waynezee in #347
- [CI] nightly multi python by @swgu98 in #344
- fix pname miss in grouped moe by @liufengwei0103 in #325
- fix rope bug by @blacksheep-Aristotle in #338
- [CI] add cancel by @swgu98 in #349
- disable fp8 and deepep when cuda12.6 by @risemeup1 in #345
- [MoE Layer] Delete moe_deep_gemm Config by @hushenwei2000 in #312
- Fix bug for tokens_unzip_gather_kernel by @DanielSun11 in #341
- fix router precision by @xuxinyi389 in #348
- Fix the bug for MultiModalRope when mbs>1 by @pkuzyc in #351
- Fix tensor model parallel world size return logic by @XieYunshen in #353
- bump sonic-moe by @ooooo-create in #355
- [CE]ADD CE by @tianlef in #316
- [CI] paddle release tag by @swgu98 in #352
- Fix the bug when get cp rank and size in rope by @pkuzyc in #358
- fix layer_norm bug by @blacksheep-Aristotle in #350
- fix seq_aux_loss by @Wennie396 in #361
- [Recompute] adapt rr and support dict in selective recompute by @Waynezee in #294
- 【moe】add moe_fuse config only lora use by @xiaoguoguo626807 in #366
- Fix the mis-match name bug of gelu_pytorch_tanh act by @pkuzyc in #363
- [CI]fix coverage by @tianlef in #369
- [DeepEP] Switch to
paddlefleet.ops.deep_epby @ooooo-create in #301 - [CI] add timeout by @swgu98 in #380
- support glm vpp overlap by @LiYuRio in #234
- [ThirdParty] Bump sonic-moe version to reduce launch triton kernel overhead by @SigureMo in #381
- [CE]add multi version python pipe by @tianlef in #357
- [MoE Layer] Default use Paddle batched_gemm when enable moe_grouped_gemm by @hushenwei2000 in #370
- fix_rr_rules by @Waynezee in #383
- [MoE Layer] Add moe_ep_barrier configuration by @hushenwei2000 in #373
- [MoE Layer] Fix AllToAll Implementation when TP > 1 by @hushenwei2000 in #360
- Revert "[DeepEP] Switch to
paddlefleet.ops.deep_ep" by @XieYunshen in #382 - add high_precision_rope by @blacksheep-Aristotle in #377
- fix_rope and seq_aux_loss by @Waynezee in #376
- Update Paddle dependency version by @swgu98 in #387
- [CI] Update grouped_gemm Unit Test for CUDA13 by @hushenwei2000 in #388
- 修改qwen3vl mrope计算逻辑 by @qhpeklh5959 in #379
- [CE]Sonic moe by @tianlef in #386
- add manual by @swgu98 in #391
- manual wheel update by @swgu98 in #392
- adapter sonic_moe by @xingmingyyj in #365
- [CherryPick] fix rope in cp by @Waynezee in #398
- [ThirdParty] Bump sonic-moe version to patch paddle.empty to support distributed env (#402) by @SigureMo in #403
- fix by @swgu98 in #410
- [cherry-pick] fix NoPipelineParallel init by @huangjiyi in #421
- [cherry-pick][Docs] update CONTRIBUTING.md by @ooooo-create in #428
New Contributors
- @From00 made their first contribution in #1
- @GuoxiaWang made their first contribution in #3
- @Hz188 made their first contribution in #8
- @risemeup1 made their first contribution in #14
- @lshpku made their first contribution in #17
- @blacksheep-Aristotle made their first contribution in #26
- @XieYunshen made their first contribution in #38
- @zhangbo9674 made their first contribution in #48
- @xiaoguoguo626807 made their first contribution in #68
- @hushenwei2000 made their first contribution in #61
- @changeyoung98 made their first contribution in #71
- @xuxinyi389 made their first contribution in #73
- @pkuzyc made their first contribution in #29
- @LiYuRio made their first contribution in #81
- @deepllz made their first contribution in #114
- @wtmlon made their first contribution in #106
- @Wennie396 made their first contribution in #129
- @A-nnonymous made their first contribution in #250
- @LLSGYN made their first contribution in #227
- @zhangting2020 made their first contribution in #266
- @sneaxiy made their first contribution in #290
- @github-actions[bot] made their first contribution in #291
- @DanielSun11 made their first contribution in #310
- @liufengwei0103 made their first contribution in #325
- @qhpeklh5959 made their first contribution in #379
Full Changelog: https://github.com/PaddlePaddle/PaddleFleet/commits/v0.1.0