v4.0.0
中文版
新特性
- 架构优化
a. 目录结构重构与依赖关系优化,使用模块化设计,提升架构的可扩展性和可定制性。
b.model_type与template解耦,简化同一 model_type 含多个 template 的模型支持流程。
c. Megatron-SWIFT 训练循环重写,使用 megatron-core 替代 megatron-lm 依赖。(兼容Ascend NPU) - Megatron-SWIFT
a. 新模型支持:Qwen3.5系列、GLM4.7-Flash、MiniMax-M2.1、OLMoE。
b. Embedding 任务支持,训练示例:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker 任务支持,训练示例:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. 新增save_total_limit参数,自动清理过期 checkpoint,并保留指标最优和最新的权重。
e. Qwen3-Next/Qwen3.5 新增apply_wd_to_qk_layernorm参数,支持对 qk layernorm 应用权重衰减。
f. 多模态MoE模型lora支持--target_modules all-router配置。 - RL
a. 支持GDPO算法计算优势,使用参数--scale_rewards gdpo。(感谢 @Auraithm 的贡献)
b. GKD 支持使用 top-k logits 计算KL以节约显存,使用参数--gkd_topk_logits。
c. GKD 支持使用 teacher server,避免显式加载教师模型。 - 训练
a. 新增 muon clip 优化器支持,训练示例:https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (感谢 @vx120 的贡献)
b. 依赖更新:兼容最新依赖 python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0等。
c. generative reranker lm_head 部分计算优化,降低显存占用。
d. fsdp2支持激活 cpu offload;deepspeed elastic支持。(感谢招商 @meichangsu1 的贡献)
新模型
- 纯文本模型
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct系列(感谢 @qianhao0713 的贡献) - 多模态模型
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B 系列。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it 系列
English Version
New Features
- Architecture Optimization
a. Directory structure refactoring and dependency optimization with modular design to enhance architecture scalability and customizability.
b. Decoupling ofmodel_typeandtemplateto simplify support for models with multiple templates under the same model_type.
c. Rewritten Megatron-SWIFT training loop using megatron-core instead of megatron-lm dependency. (Compatible with Ascend NPU) - Megatron-SWIFT
a. New model support: Qwen3.5 series, GLM4.7-Flash, MiniMax-M2.1, OLMoE.
b. Embedding task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. Addedsave_total_limitparameter to automatically clean up expired checkpoints while retaining the best-performing and latest weights.
e. Addedapply_wd_to_qk_layernormparameter for Qwen3-Next/Qwen3.5 to support weight decay on qk layernorm.
f. Multi-modal MoE model LoRA supports--target_modules all-routerconfiguration. - RL
a. Support for GDPO algorithm to compute advantages using parameter--scale_rewards gdpo. (Thanks to @Auraithm)
b. GKD supports using top-k logits to compute KL for memory savings with parameter--gkd_topk_logits.
c. GKD supports using teacher server to avoid explicitly loading the teacher model. - Training
a. Added Muon-CLIP optimizer support. Training example: https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (Thanks to @vx120)
b. Dependency updates: Compatible with latest dependencies including python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0, etc.
c. Optimized generative reranker lm_head computation to reduce memory usage.
d. FSDP2 supports CPU offload activation; DeepSpeed elastic support. (Thanks to @meichangsu1)
New Models
- Text-only Models
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct series (Thanks to @qianhao0713) - Multi-modal Models
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it series
What's Changed
- [misc] update swift patch_conv3d by @Jintao-Huang in #7320
- add npu megatron multi-node example by @addsubmuldiv in #7321
- [bugfix] fix megatron convert by @Jintao-Huang in #7323
- [model] Support Qwen3-VL-Embedding/Qwen3-VL-Reranker by @Jintao-Huang in #7329
- [reranker] refactor reranker by @Jintao-Huang in #7334
- [bugfix] fix video base64 torchcodec by @Jintao-Huang in #7338
- [bugfix] fix modelopt by @Jintao-Huang in #7339
- [docs] Update swift image 3.12 by @Jintao-Huang in #7332
- [bugfix] fix get_chunked_inputs slice by @hjh0119 in #7346
- fix find node ip by @tastelikefeet in #7350
- Fix multi-modal reranker doc by @tastelikefeet in #7354
- [bugfix] fix app_args by @Jintao-Huang in #7367
- [bugfix] fix qwen2_vl video by @Jintao-Huang in #7376
- [bugfix] fix vllm moe model load_weights by @hjh0119 in #7362
- [v4] refactor ms-swift v4 by @Jintao-Huang in #7238
- feat: support scale rewards "gdpo" by @Auraithm in #7348
- [infer] infer backend pt -> transformers by @Jintao-Huang in #7379
- [docs] update docs & update Copyright by @Jintao-Huang in #7384
- Fix device mismatch in _forward_qwen3_vl_or_qwen3_omni when computing visual_pos_masks by @yaqiangsun in #7372
- add npu qwen3-next example and warning of ep size by @addsubmuldiv in #7390
- [bugfix] fix deepseek_v3_1 thinking template by @Jintao-Huang in #7388
- [docs] update docs & update dataset 'loss' by @Jintao-Huang in #7402
- [bugfix] Fix ref adapters trainable params 0 by @Jintao-Huang in #7403
- [readme] update error timeline of news by @shizhengLi in #7404
- [bugfix] fix sp reranker by @Jintao-Huang in #7405
- [v4] fix ci by @Jintao-Huang in #7559
- [refactor] reorganize reward and rollout modules into dedicated direct… by @hjh0119 in #7397
- [grpo] speedup grpo train stage encode with concurrent by @Cccei000 in #7391
- Update the NPU-supported features table by @addsubmuldiv in #7562
- [bugfix] fix attn_impl by @Jintao-Huang in #7564
- [v4] refactor ms-swift v4 (pipelines/arguments/swiftmixin/callback/tuner_plugin) by @Jintao-Huang in #7385
- [bugfix] fix minimax tp by @Jintao-Huang in #7788
- fix inputs_embeds for hunyuanOCR by @slin000111 in #7803
- [bugfix] fix deepspeed distributed weight offload code by @Silas-11 in #7802
- [generative_reranker] generative reranker logits memory optimization by @Jintao-Huang in #7816
- update requirements by @Jintao-Huang in #7819
- [misc] update issue template by @Jintao-Huang in #7818
- [bugfix] fix dpo by @Jintao-Huang in #7824
- update wechat by @tastelikefeet in #7827
- [bugfix] fix deepspeed optimizer offload code by @Silas-11 in #7821
- [model] support glm4_moe_lite by @Jintao-Huang in #7829
- [bugfix] fix hunyuan ocr by @Jintao-Huang in #7831
- [megatron] support glm_moe_lite by @Jintao-Huang in #7833
- chore: epochs -> epoch by @zzc0430 in #7825
- [optimizer] Set loss mask to compute the loss for multi-turn reasoning by @Simon-ss7 in #7838
- [bugfix] fix recompute_granularity none by @Jintao-Huang in #7842
- refactor patch model by @Jintao-Huang in #7841
- [bugfix] fix trainer by @Jintao-Huang in #7843
- fix ckpt_dir and get_choices for web-ui by @slin000111 in #7850
- correct sapo formula by @hjh0119 in #7852
- [fix] fix pass multiple value of data collator by @hjh0119 in #7855
- [v4] fix ppo by @Jintao-Huang in #7857
- [bugfix] set rollout server seed to avoid Identical completions by @hjh0119 in #7858
- [shell] update embedding/reranker shell by @Jintao-Huang in #7861
- [megatron] fix: remove vllm dependency in megatron rlhf by @Jintao-Huang in #7864
- [megatron] support megatron embedding by @Jintao-Huang in #7862
- [reranker] reranker padding_free right (default value) by @Jintao-Huang in #7869
- [bugfix] fix npu cast error after apply fsdp2 by @Silas-11 in #7870
- [megatron] support megatron reranker by @Jintao-Huang in #7630
- [bugfix] fix loss_scale by @Jintao-Huang in #7873
- [docs] update loss_scale docs by @Jintao-Huang in #7874
- [model] support olmoe by @qianhao0713 in #7140
- [megatron] update olmoe by @Jintao-Huang in #7877
- [bugfix] fix megatron kto pp + sp by @Jintao-Huang in #7882
- [feat] support deepspeed elastic by @meichangsu1 in #6955
- [docs] update megatron-swift wechat by @Jintao-Huang in #7888
- [docs] update swift image 3.12.3 by @Jintao-Huang in #7890
- [compat] compat transformers main branch (v5) by @Jintao-Huang in #7895
- [bugfix] Fix metric megatron by @Jintao-Huang in #7905
- [bugfix] fix dataset hash by @Jintao-Huang in #7916
- [model] support deepseek-ocr-2 by @hjh0119 in #7917
- [bugfix] fix glm template by @Jintao-Huang in #7928
- [bugfix] fix template_meta by @Jintao-Huang in #7930
- [compat] compat transformers5 rope by @Jintao-Huang in #7931
- [bugfix] fix template suffix by @Jintao-Huang in #7937
- support step3-vl-10b by @slin000111 in #7938
- [bugfix] fix gkd moe teacher init by @hjh0119 in #7940
- [compat] compat mcore_bridge transformers 5 by @Jintao-Huang in #7939
- Enhance NPU LoRA path with post-norm activation handling by @vx120 in #7929
- [megatron] fix megatron qwen3_next TP high grad_norm by @Jintao-Huang in #7941
- [docs]Upgrade MindSpeed to stable maintenance version. by @Ginray in #7943
- [bugfix] fix megatron tp init seed by @Jintao-Huang in #7944
- feat(swanlab): support email notification with dedicated arguments by @ciaoyizhen in #7949
- [bugfix] fix megatron lora TP all-reduce by @Jintao-Huang in #7911
- [megatron] support megatron all-router multimodal by @Jintao-Huang in #7951
- [megatron] support Qwen3-Next apply_wd_to_qk_layernorm by @Jintao-Huang in #7954
- [model] Support Qwen3-Coder-Next by @Jintao-Huang in #7958
- support PaddleOCR-VL-1.5 by @slin000111 in #7979
- [bugfix] fix apply_wd_to_qk_layernorm by @Jintao-Huang in #7980
- Fix typo in multi_turn.md regarding rollout logps by @Marquis03 in #7982
- fix: handle None padding_to in get_padding_to() for fused attention by @Mr-Neutr0n in #8002
- [bugfix] Fix args template type by @Jintao-Huang in #8005
- [trainer] update time format & fix resume from checkpoint train_speed by @Jintao-Huang in #8007
- [model] Support minicpmo-4.5 by @Jintao-Huang in #8015
- [bugfix] fix _set_property by @Jintao-Huang in #8019
- [infer/deploy] Update result path by @Jintao-Huang in #8022
- support GLM-OCR by @slin000111 in #8021
- [docs] add gpt bridge docs by @Jintao-Huang in #8023
- [model] support qwen3_5 / qwen3_5_moe by @Jintao-Huang in #8016
- fix swift client for reranker by @slin000111 in #8026
- [bugfix] fix megatron llama4 by @Jintao-Huang in #8027
- [feat] support activation cpu offload in fsdp and fsdp2 by @meichangsu1 in #7201
- fix: aligns GRPOConfig with the upstream trl && update docs by @Tohrusky in #8003
- [CI] fix ci temporary by @Jintao-Huang in #8045
- fix generation-batch-size&steps_per_generation check by @hjh0119 in #8048
- [docs] update swift image 3.12.5 by @Jintao-Huang in #8051
- [v4] refactor megatron-swift (use megatron-core) by @Jintao-Huang in #7945
- [megatron] fix get_mcore_model_config by @Jintao-Huang in #8057
- [model] support Qwen3.5-397B-A17B by @Jintao-Huang in #8058
- chore: bump trl to 0.28 by @hjh0119 in #8061
- [megatron] fix optimizer save by @Jintao-Huang in #8060
- [bugfix] fix download model vllm_engine by @Jintao-Huang in #8062
- [megatron] fix async save by @Jintao-Huang in #8055
- [docs] update docs by @Jintao-Huang in #8064
- fix moe ring attention by @tastelikefeet in #8067
- [bugfix] fix megatron-swift pp by @Jintao-Huang in #8071
- chore: bump vllm to 0.15.1 by @hjh0119 in #7867
- [megatron] support save_total_limit by @Jintao-Huang in #8056
- [misc] update requirements by @Jintao-Huang in #8072
- [misc] lint compat python3.12 by @Jintao-Huang in #8073
- [compat] compat transformers 5.2.0 by @Jintao-Huang in #8075
- [megatron] update megatron_swift parameter by @Jintao-Huang in #8077
- [bugfix] fix grpo move_modal_batches by @hjh0119 in #8078
- [bugfix] fix qwen3_5 fp8 gpt-bridge by @Jintao-Huang in #8076
- [misc] simplify megatron resample_data_iterator management by @hjh0119 in #8082
- [model] support GLM-5 (transformers) by @Jintao-Huang in #8066
- [model] support more qwen3.5 models by @Jintao-Huang in #8088
- [model] add qwen3.5 megatron/transformers shell by @Jintao-Huang in #8090
- [bugfix] fix megatron-swift mla & channel_loss by @Jintao-Huang in #8092
- [megatron] update seq aux log by @Jintao-Huang in #8100
- [megatron] add micro_batch_size check by @Jintao-Huang in #8103
- [bugfix] fix qwen3_omni all_linear aligner by @Jintao-Huang in #8105
- [bugfix] compat transformers 5.0 audio by @Jintao-Huang in #8104
- fix: add missing import re in utils.py by @zhaohan-alan in #8113
- lint pass by @Jintao-Huang in #8114
- [bugfix] fix dpo megatron by @Jintao-Huang in #8116
- [bugfix] fix model_type vllm_engine by @Jintao-Huang in #8117
- [bugfix] fix overlap grad_reduce by @Jintao-Huang in #8079
- fix(grpo): Fix NCCL timeout/hang in ZeRO-3 with dynamic batch sizes by @azusa-nami in #8102
- [megatron] multinode megatron (non-shared disk) by @Jintao-Huang in #8120
- [bugfix] fix grpo gdpo with None reward by @hjh0119 in #8125
- chore: bump liger-kernel to 0.7.0 by @hjh0119 in #8131
- [fix] adapt megatron and mindspeed for npu by @jiaqiw09 in #8121
- [orm] pass args to simplify orm construction by @hjh0119 in #8137
- [model] support qwen3.5 more models (fp8) by @Jintao-Huang in #8136
- [Fix] fix npu issues by @jiaqiw09 in #8141
- update swift_patch_conv3d by @Jintao-Huang in #8146
- [megatron] add warmup jit by @Jintao-Huang in #8147
- update faq by @slin000111 in #8128
- [examples] Update shell by @Jintao-Huang in #8149
- [megatron] fix save latest_checkpointed_iteration by @Jintao-Huang in #8151
- [bugfix] fix megatron vpp by @Jintao-Huang in #8153
- [bugfix] fix contiguous by @Jintao-Huang in #8158
- [docs] Update readme by @Jintao-Huang in #8144
- [bugfix] fix qwen3_5 agent template by @Jintao-Huang in #8161
- [model] support Qwen3.5-0.8B/2B/4B/9B series by @Jintao-Huang in #8162
- [shell] update shell by @Jintao-Huang in #8163
- [docs] update docs by @Jintao-Huang in #8168
- [misc] remove estimate_token for grpo by @hjh0119 in #8150
- [gkd] top-k-logits & teacher server by @hjh0119 in #7918
New Contributors
- @Auraithm made their first contribution in #7348
- @shizhengLi made their first contribution in #7404
- @Cccei000 made their first contribution in #7391
- @Silas-11 made their first contribution in #7802
- @qianhao0713 made their first contribution in #7140
- @Ginray made their first contribution in #7943
- @ciaoyizhen made their first contribution in #7949
- @Mr-Neutr0n made their first contribution in #8002
- @Tohrusky made their first contribution in #8003
- @zhaohan-alan made their first contribution in #8113
- @azusa-nami made their first contribution in #8102
- @jiaqiw09 made their first contribution in #8121
Full Changelog: v3.12.6...v4.0.0