Release v4.0.0 · modelscope/ms-swift

中文版

新特性

架构优化
a. 目录结构重构与依赖关系优化，使用模块化设计，提升架构的可扩展性和可定制性。
b. model_type与template解耦，简化同一 model_type 含多个 template 的模型支持流程。
c. Megatron-SWIFT 训练循环重写，使用 megatron-core 替代 megatron-lm 依赖。（兼容Ascend NPU）
Megatron-SWIFT
a. 新模型支持：Qwen3.5系列、GLM4.7-Flash、MiniMax-M2.1、OLMoE。
b. Embedding 任务支持，训练示例：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker 任务支持，训练示例：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. 新增save_total_limit参数，自动清理过期 checkpoint，并保留指标最优和最新的权重。
e. Qwen3-Next/Qwen3.5 新增apply_wd_to_qk_layernorm参数，支持对 qk layernorm 应用权重衰减。
f. 多模态MoE模型lora支持 --target_modules all-router 配置。
RL
a. 支持GDPO算法计算优势，使用参数--scale_rewards gdpo。（感谢 @Auraithm 的贡献）
b. GKD 支持使用 top-k logits 计算KL以节约显存，使用参数 --gkd_topk_logits。
c. GKD 支持使用 teacher server，避免显式加载教师模型。
训练
a. 新增 muon clip 优化器支持，训练示例：https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh （感谢 @vx120 的贡献）
b. 依赖更新：兼容最新依赖 python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0等。
c. generative reranker lm_head 部分计算优化，降低显存占用。
d. fsdp2支持激活 cpu offload；deepspeed elastic支持。（感谢招商 @meichangsu1 的贡献）

新模型

纯文本模型
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct系列（感谢 @qianhao0713 的贡献）
多模态模型
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B 系列。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it 系列

English Version

New Features

Architecture Optimization
a. Directory structure refactoring and dependency optimization with modular design to enhance architecture scalability and customizability.
b. Decoupling of model_type and template to simplify support for models with multiple templates under the same model_type.
c. Rewritten Megatron-SWIFT training loop using megatron-core instead of megatron-lm dependency. (Compatible with Ascend NPU)
Megatron-SWIFT
a. New model support: Qwen3.5 series, GLM4.7-Flash, MiniMax-M2.1, OLMoE.
b. Embedding task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. Added save_total_limit parameter to automatically clean up expired checkpoints while retaining the best-performing and latest weights.
e. Added apply_wd_to_qk_layernorm parameter for Qwen3-Next/Qwen3.5 to support weight decay on qk layernorm.
f. Multi-modal MoE model LoRA supports --target_modules all-router configuration.
RL
a. Support for GDPO algorithm to compute advantages using parameter --scale_rewards gdpo. (Thanks to @Auraithm)
b. GKD supports using top-k logits to compute KL for memory savings with parameter --gkd_topk_logits.
c. GKD supports using teacher server to avoid explicitly loading the teacher model.
Training
a. Added Muon-CLIP optimizer support. Training example: https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (Thanks to @vx120)
b. Dependency updates: Compatible with latest dependencies including python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0, etc.
c. Optimized generative reranker lm_head computation to reduce memory usage.
d. FSDP2 supports CPU offload activation; DeepSpeed elastic support. (Thanks to @meichangsu1)

New Models

Text-only Models
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct series (Thanks to @qianhao0713)
Multi-modal Models
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it series

What's Changed

[misc] update swift patch_conv3d by @Jintao-Huang in #7320
add npu megatron multi-node example by @addsubmuldiv in #7321
[bugfix] fix megatron convert by @Jintao-Huang in #7323
[model] Support Qwen3-VL-Embedding/Qwen3-VL-Reranker by @Jintao-Huang in #7329
[reranker] refactor reranker by @Jintao-Huang in #7334
[bugfix] fix video base64 torchcodec by @Jintao-Huang in #7338
[bugfix] fix modelopt by @Jintao-Huang in #7339
[docs] Update swift image 3.12 by @Jintao-Huang in #7332
[bugfix] fix get_chunked_inputs slice by @hjh0119 in #7346
fix find node ip by @tastelikefeet in #7350
Fix multi-modal reranker doc by @tastelikefeet in #7354
[bugfix] fix app_args by @Jintao-Huang in #7367
[bugfix] fix qwen2_vl video by @Jintao-Huang in #7376
[bugfix] fix vllm moe model load_weights by @hjh0119 in #7362
[v4] refactor ms-swift v4 by @Jintao-Huang in #7238
feat: support scale rewards "gdpo" by @Auraithm in #7348
[infer] infer backend pt -> transformers by @Jintao-Huang in #7379
[docs] update docs & update Copyright by @Jintao-Huang in #7384
Fix device mismatch in _forward_qwen3_vl_or_qwen3_omni when computing visual_pos_masks by @yaqiangsun in #7372
add npu qwen3-next example and warning of ep size by @addsubmuldiv in #7390
[bugfix] fix deepseek_v3_1 thinking template by @Jintao-Huang in #7388
[docs] update docs & update dataset 'loss' by @Jintao-Huang in #7402
[bugfix] Fix ref adapters trainable params 0 by @Jintao-Huang in #7403
[readme] update error timeline of news by @shizhengLi in #7404
[bugfix] fix sp reranker by @Jintao-Huang in #7405
[v4] fix ci by @Jintao-Huang in #7559
[refactor] reorganize reward and rollout modules into dedicated direct… by @hjh0119 in #7397
[grpo] speedup grpo train stage encode with concurrent by @Cccei000 in #7391
Update the NPU-supported features table by @addsubmuldiv in #7562
[bugfix] fix attn_impl by @Jintao-Huang in #7564
[v4] refactor ms-swift v4 (pipelines/arguments/swiftmixin/callback/tuner_plugin) by @Jintao-Huang in #7385
[bugfix] fix minimax tp by @Jintao-Huang in #7788
fix inputs_embeds for hunyuanOCR by @slin000111 in #7803
[bugfix] fix deepspeed distributed weight offload code by @Silas-11 in #7802
[generative_reranker] generative reranker logits memory optimization by @Jintao-Huang in #7816
update requirements by @Jintao-Huang in #7819
[misc] update issue template by @Jintao-Huang in #7818
[bugfix] fix dpo by @Jintao-Huang in #7824
update wechat by @tastelikefeet in #7827
[bugfix] fix deepspeed optimizer offload code by @Silas-11 in #7821
[model] support glm4_moe_lite by @Jintao-Huang in #7829
[bugfix] fix hunyuan ocr by @Jintao-Huang in #7831
[megatron] support glm_moe_lite by @Jintao-Huang in #7833
chore: epochs -> epoch by @zzc0430 in #7825
[optimizer] Set loss mask to compute the loss for multi-turn reasoning by @Simon-ss7 in #7838
[bugfix] fix recompute_granularity none by @Jintao-Huang in #7842
refactor patch model by @Jintao-Huang in #7841
[bugfix] fix trainer by @Jintao-Huang in #7843
fix ckpt_dir and get_choices for web-ui by @slin000111 in #7850
correct sapo formula by @hjh0119 in #7852
[fix] fix pass multiple value of data collator by @hjh0119 in #7855
[v4] fix ppo by @Jintao-Huang in #7857
[bugfix] set rollout server seed to avoid Identical completions by @hjh0119 in #7858
[shell] update embedding/reranker shell by @Jintao-Huang in #7861
[megatron] fix: remove vllm dependency in megatron rlhf by @Jintao-Huang in #7864
[megatron] support megatron embedding by @Jintao-Huang in #7862
[reranker] reranker padding_free right (default value) by @Jintao-Huang in #7869
[bugfix] fix npu cast error after apply fsdp2 by @Silas-11 in #7870
[megatron] support megatron reranker by @Jintao-Huang in #7630
[bugfix] fix loss_scale by @Jintao-Huang in #7873
[docs] update loss_scale docs by @Jintao-Huang in #7874
[model] support olmoe by @qianhao0713 in #7140
[megatron] update olmoe by @Jintao-Huang in #7877
[bugfix] fix megatron kto pp + sp by @Jintao-Huang in #7882
[feat] support deepspeed elastic by @meichangsu1 in #6955
[docs] update megatron-swift wechat by @Jintao-Huang in #7888
[docs] update swift image 3.12.3 by @Jintao-Huang in #7890
[compat] compat transformers main branch (v5) by @Jintao-Huang in #7895
[bugfix] Fix metric megatron by @Jintao-Huang in #7905
[bugfix] fix dataset hash by @Jintao-Huang in #7916
[model] support deepseek-ocr-2 by @hjh0119 in #7917
[bugfix] fix glm template by @Jintao-Huang in #7928
[bugfix] fix template_meta by @Jintao-Huang in #7930
[compat] compat transformers5 rope by @Jintao-Huang in #7931
[bugfix] fix template suffix by @Jintao-Huang in #7937
support step3-vl-10b by @slin000111 in #7938
[bugfix] fix gkd moe teacher init by @hjh0119 in #7940
[compat] compat mcore_bridge transformers 5 by @Jintao-Huang in #7939
Enhance NPU LoRA path with post-norm activation handling by @vx120 in #7929
[megatron] fix megatron qwen3_next TP high grad_norm by @Jintao-Huang in #7941
[docs]Upgrade MindSpeed to stable maintenance version. by @Ginray in #7943
[bugfix] fix megatron tp init seed by @Jintao-Huang in #7944
feat(swanlab): support email notification with dedicated arguments by @ciaoyizhen in #7949
[bugfix] fix megatron lora TP all-reduce by @Jintao-Huang in #7911
[megatron] support megatron all-router multimodal by @Jintao-Huang in #7951
[megatron] support Qwen3-Next apply_wd_to_qk_layernorm by @Jintao-Huang in #7954
[model] Support Qwen3-Coder-Next by @Jintao-Huang in #7958
support PaddleOCR-VL-1.5 by @slin000111 in #7979
[bugfix] fix apply_wd_to_qk_layernorm by @Jintao-Huang in #7980
Fix typo in multi_turn.md regarding rollout logps by @Marquis03 in #7982
fix: handle None padding_to in get_padding_to() for fused attention by @Mr-Neutr0n in #8002
[bugfix] Fix args template type by @Jintao-Huang in #8005
[trainer] update time format & fix resume from checkpoint train_speed by @Jintao-Huang in #8007
[model] Support minicpmo-4.5 by @Jintao-Huang in #8015
[bugfix] fix _set_property by @Jintao-Huang in #8019
[infer/deploy] Update result path by @Jintao-Huang in #8022
support GLM-OCR by @slin000111 in #8021
[docs] add gpt bridge docs by @Jintao-Huang in #8023
[model] support qwen3_5 / qwen3_5_moe by @Jintao-Huang in #8016
fix swift client for reranker by @slin000111 in #8026
[bugfix] fix megatron llama4 by @Jintao-Huang in #8027
[feat] support activation cpu offload in fsdp and fsdp2 by @meichangsu1 in #7201
fix: aligns GRPOConfig with the upstream trl && update docs by @Tohrusky in #8003
[CI] fix ci temporary by @Jintao-Huang in #8045
fix generation-batch-size&steps_per_generation check by @hjh0119 in #8048
[docs] update swift image 3.12.5 by @Jintao-Huang in #8051
[v4] refactor megatron-swift (use megatron-core) by @Jintao-Huang in #7945
[megatron] fix get_mcore_model_config by @Jintao-Huang in #8057
[model] support Qwen3.5-397B-A17B by @Jintao-Huang in #8058
chore: bump trl to 0.28 by @hjh0119 in #8061
[megatron] fix optimizer save by @Jintao-Huang in #8060
[bugfix] fix download model vllm_engine by @Jintao-Huang in #8062
[megatron] fix async save by @Jintao-Huang in #8055
[docs] update docs by @Jintao-Huang in #8064
fix moe ring attention by @tastelikefeet in #8067
[bugfix] fix megatron-swift pp by @Jintao-Huang in #8071
chore: bump vllm to 0.15.1 by @hjh0119 in #7867
[megatron] support save_total_limit by @Jintao-Huang in #8056
[misc] update requirements by @Jintao-Huang in #8072
[misc] lint compat python3.12 by @Jintao-Huang in #8073
[compat] compat transformers 5.2.0 by @Jintao-Huang in #8075
[megatron] update megatron_swift parameter by @Jintao-Huang in #8077
[bugfix] fix grpo move_modal_batches by @hjh0119 in #8078
[bugfix] fix qwen3_5 fp8 gpt-bridge by @Jintao-Huang in #8076
[misc] simplify megatron resample_data_iterator management by @hjh0119 in #8082
[model] support GLM-5 (transformers) by @Jintao-Huang in #8066
[model] support more qwen3.5 models by @Jintao-Huang in #8088
[model] add qwen3.5 megatron/transformers shell by @Jintao-Huang in #8090
[bugfix] fix megatron-swift mla & channel_loss by @Jintao-Huang in #8092
[megatron] update seq aux log by @Jintao-Huang in #8100
[megatron] add micro_batch_size check by @Jintao-Huang in #8103
[bugfix] fix qwen3_omni all_linear aligner by @Jintao-Huang in #8105
[bugfix] compat transformers 5.0 audio by @Jintao-Huang in #8104
fix: add missing import re in utils.py by @zhaohan-alan in #8113
lint pass by @Jintao-Huang in #8114
[bugfix] fix dpo megatron by @Jintao-Huang in #8116
[bugfix] fix model_type vllm_engine by @Jintao-Huang in #8117
[bugfix] fix overlap grad_reduce by @Jintao-Huang in #8079
fix(grpo): Fix NCCL timeout/hang in ZeRO-3 with dynamic batch sizes by @azusa-nami in #8102
[megatron] multinode megatron (non-shared disk) by @Jintao-Huang in #8120
[bugfix] fix grpo gdpo with None reward by @hjh0119 in #8125
chore: bump liger-kernel to 0.7.0 by @hjh0119 in #8131
[fix] adapt megatron and mindspeed for npu by @jiaqiw09 in #8121
[orm] pass args to simplify orm construction by @hjh0119 in #8137
[model] support qwen3.5 more models (fp8) by @Jintao-Huang in #8136
[Fix] fix npu issues by @jiaqiw09 in #8141
update swift_patch_conv3d by @Jintao-Huang in #8146
[megatron] add warmup jit by @Jintao-Huang in #8147
update faq by @slin000111 in #8128
[examples] Update shell by @Jintao-Huang in #8149
[megatron] fix save latest_checkpointed_iteration by @Jintao-Huang in #8151
[bugfix] fix megatron vpp by @Jintao-Huang in #8153
[bugfix] fix contiguous by @Jintao-Huang in #8158
[docs] Update readme by @Jintao-Huang in #8144
[bugfix] fix qwen3_5 agent template by @Jintao-Huang in #8161
[model] support Qwen3.5-0.8B/2B/4B/9B series by @Jintao-Huang in #8162
[shell] update shell by @Jintao-Huang in #8163
[docs] update docs by @Jintao-Huang in #8168
[misc] remove estimate_token for grpo by @hjh0119 in #8150
[gkd] top-k-logits & teacher server by @hjh0119 in #7918

New Contributors

@Auraithm made their first contribution in #7348
@shizhengLi made their first contribution in #7404
@Cccei000 made their first contribution in #7391
@Silas-11 made their first contribution in #7802
@qianhao0713 made their first contribution in #7140
@Ginray made their first contribution in #7943
@ciaoyizhen made their first contribution in #7949
@Mr-Neutr0n made their first contribution in #8002
@Tohrusky made their first contribution in #8003
@zhaohan-alan made their first contribution in #8113
@azusa-nami made their first contribution in #8102
@jiaqiw09 made their first contribution in #8121

Full Changelog: v3.12.6...v4.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.0.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

中文版

新特性

新模型

English Version

New Features

New Models

What's Changed

New Contributors

Contributors

Uh oh!