Releases · modelscope/ms-swift

Megatron-SWIFT
a. 新增 model_type 支持：kimi_k25、hy_v3、llava_onevision。（llava_onevision 感谢 @randydl 的贡献）
b. 支持 GLM-5 共享参数 MTP，可通过 --mtp_shared_weights 参数启用。
c. 支持 Qwen3.5 FP8 训练，训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
d. 自定义 Megatron 模型文档：https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Custom-Model.html
e. 支持控制 MTP 分支中 decoder_input 是否停止梯度，即 MTP loss 能否直接通过 decoder_input 回传梯度到 Embedding/ViT，可通过 --mtp_decoder_input_detach 参数控制。
f. mlp_padding_free 参数兼容序列并行
g. 支持通过 megatron export 命令进行权重 FP8 量化导出，脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
h. 移除对 megatron-core 0.12 - 0.14 版本的依赖兼容支持。
RL
a. GKD/OPSD 支持设置 generation_batch_size/steps_per_generaiton 参数。
b. GKD/OPSD teacher_server_api 兼容多模态训练。
c. GKD/OPSD 兼容 padding_free。
d. Megatron GRPO/GKD 权重同步支持仅同步 LoRA 权重。
e. swift rollout 新增异常捕获机制，避免进程静默卡死。
f. GRPO ref_sync_callback 支持在 ZeRO-3 下进行分层 gather，避免 OOM。
g. GRPO TRL 依赖版本升级至 >= 0.26。
训练
a. 支持 Qwen3.5 序列并行，可通过 --sequence_parallel_size 参数控制。（感谢 @meichangsu1 的贡献）
b. 支持在数据集中直接指定 loss_scale，提供更灵活的控制方式，参考文档：https://swift.readthedocs.io/zh-cn/latest/Customization/Custom-dataset.html#id4
c. 数据集 datasets 依赖兼容 4.x 版本。
d. cached_dataset 与 --truncation_strategy split 策略兼容。
硬件
a. NPU 支持基于 transformers/Megatron 后端的 Qwen3.5 训练，使用 Megatron 后端时需开启 USE_MCORE_GDN=0 环境变量。（感谢 @addsubmuldiv、@hazelduan 的贡献）
b. 新增 AMD 支持文档：https://swift.readthedocs.io/zh-cn/latest/BestPractices/AMD-support.html （感谢 @Treemann 的贡献）
c. 支持 Metax 硬件的 RL 训练。（感谢 @suenphey 的贡献）
d. NPU Megatron 训练兼容 megatron-core 0.15.3。（感谢 @addsubmuldiv 的贡献）

新模型

纯文本模型
a. ZhipuAI/GLM-5.1
b. MiniMax/MiniMax-M2.7
c. moonshotai/Kimi-K2.6（仅含纯文本）
d. Tencent-Hunyuan/Hy3-preview
e. AIDC-AI/Marco-Nano-Instruct 系列
多模态模型
a. Qwen/Qwen3.6-35B-A3B、Qwen/Qwen3.6-27B
b. Qwen3-ASR（感谢 @xut806 的贡献）
c. Gemma4 系列模型混合模态数据集训练支持
d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
e. OpenBMB/MiniCPM-o-4_5 新增音频模态支持（感谢 @fanqiNO1 的贡献）
f. allenai/Molmo2-4B（感谢 @Kagura-0001 的贡献）

English Version

New Features

Megatron-SWIFT
a. Added model_type support: kimi_k25, hy_v3, llava_onevision. (llava_onevision contributed by @randydl)
b. Added support for GLM-5 shared-parameter MTP, which can be enabled via the --mtp_shared_weights argument.
c. Added support for Qwen3.5 FP8 training. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
d. Custom Megatron model documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Custom-Model.html
e. Added support for controlling whether decoder_input stops gradient in the MTP branch (i.e., whether MTP loss can backpropagate gradients through decoder_input to Embedding/ViT), configurable via the --mtp_decoder_input_detach argument.
f. mlp_padding_free is now compatible with Sequence Parallelism.
g. Added support for FP8 quantization export via the megatron export command. Script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
h. Removed dependency compatibility support for megatron-core versions 0.12 - 0.14.
RL
a. GKD/OPSD now supports the generation_batch_size/steps_per_generation parameters.
b. GKD/OPSD teacher_server_api is now compatible with multimodal training.
c. GKD/OPSD is now compatible with padding_free.
d. Megatron GRPO/GKD weight synchronization now supports syncing LoRA weights only.
e. Added exception handling to swift rollout to prevent silent process hangs.
f. GRPO ref_sync_callback now supports layer-wise gather under ZeRO-3 to avoid OOM.
g. GRPO TRL dependency upgraded to >= 0.26.
Training
a. Added support for Qwen3.5 Sequence Parallelism, controllable via the --sequence_parallel_size argument. (Contributed by @meichangsu1)
b. Added support for specifying loss_scale directly in the dataset for more flexible loss control. Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#supervised-fine-tuning
c. Dataset dependency is now compatible with datasets 4.x.
d. cached_dataset is now compatible with the --truncation_strategy split strategy.
Hardware
a. NPU now supports Qwen3.5 training with transformers/Megatron backends. When using the Megatron backend, the USE_MCORE_GDN=0 environment variable must be set. (Contributed by @addsubmuldiv, @hazelduan)
b. Added AMD support documentation: https://swift.readthedocs.io/en/latest/BestPractices/AMD-support.html (Contributed by @Treemann)
c. Added RL training support for MetaX hardware. (Contributed by @suenphey)
d. NPU Megatron training is now compatible with megatron-core 0.15.3. (Contributed by @addsubmuldiv)

New Models

Text-only Models
a. ZhipuAI/GLM-5.1
b. MiniMax/MiniMax-M2.7
c. moonshotai/Kimi-K2.6 (text-only)
d. Tencent-Hunyuan/Hy3-preview
e. AIDC-AI/Marco-Nano-Instruct series
Multimodal Models
a. Qwen/Qwen3.6-35B-A3B, Qwen/Qwen3.6-27B
b. Qwen3-ASR (Contributed by @xut806)
c. Added mixed-modality dataset training support for Gemma4 series models.
d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
e. OpenBMB/MiniCPM-o-4_5 now supports audio modality. (Contributed by @fanqiNO1)
f. allenai/Molmo2-4B (Contributed by @Kagura-0001)

What's Changed

[model] Support GLM-5.1 by @Jintao-Huang in #9038
[docs] update readme by @Jintao-Huang in #9043
[docs] update qwen3.5 best practice by @zhangfanTJU in #9039
[bugfix] sync template.padding_free with args after prepare_model for… by @yaoruda in #9031
[bugfix] fix gemma4 audio batch by @Jintao-Huang in #9045
[megatron] refactor forward_step_helper by @Jintao-Huang in #9048
[megatron] update megatron destroy_process_group by @Jintao-Huang in #9052
feat: add Qwen3-ASR model support (#8118) by @xut806 in #9034
[bugfix] fix multi-node server mode weight sync race condition by @sys-reasoner in #9060
update qwen_asr by @Jintao-Huang in #9061
[bugfix] fix qwen3_reranker mcore_model_type by @Jintao-Huang in #9062
[bugfix] fix qwen3 omni template by @addsubmuldiv in #9066
[docs] add AMD best practices by @Treemann in #9069
Update npu mindspeed doc and fix new version mindspeed's cp error by @addsubmuldiv in #9067
[bugfix] fix megatron vllm_engine_kwargs & cosine_max_len by @hjh0119 in #9072
[bugfix] fix transformers generate default top_k by @hjh0119 in #9071
[model] support MinerU2.5-Pro by @Jintao-Huang in #9074
[bugfix] fix megatron pt by @Jintao-Huang in #9075
[model] Support minimax 2.7 by @Jintao-Huang in #9079
[bugfix] fix gemma4 31b by @Jintao-Huang in #9080
[bugfix] fix vllm (0.19.0) qwen3_5 by @Jintao-Huang in #9086
[bugfix] fix gemma4 zero3 by @Jintao-Huang in #9083
[bugfix] fix gemma4 system by @Jintao-Huang in #9089
[bugfix] fix bge-m3 reranker by @Jintao-Huang in #9091
remove prompt id for megatron grpo by @hjh0119 in #9094
[docs] update npu docs en by @Jintao-Huang in #9097
[metax] support pynccl communicator in vllm by @suenphey in #9090
[bugfix] fix megatron finetune by @Jintao-Huang in #9099
[grpo] set default load_format auto by @hjh0119 in #9100
update qr code by @tastelikefeet in #9109
Optimize weight synchronization for LoRA adapter weights by @hjh0119 in #9077
support gemma4 vllm multi-modal inference by @hjh0119 in #9105
[bugfix] fix gptq transformers>=5.0 by @Jintao-Huang in #9042
[bugfix] Fix gemma4 image template by @Jintao-Huang in #9115
fix bugs by @hpsun1109 in #9120
[bugfix] fix vit_gc by @Jintao-Huang in #9125
[megatron] support qwen3.5 fp8 by @Jintao-Huang in #9106
fix chunked data slicing in multi-turn GRPO by @hjh0119 in #9128
[model] support qwen3.6 by @Jintao-Huang in #9129
[bugfix] fix vllm mtp by @Jintao-Huang in #9138
Update shell by @Jintao-Huang in #9140
[model] Support Marco by @Jintao-Huang in https://github.com...

Contributors

perone, hpsun1109, and 17 other contributors

Assets 2

25 Apr 13:50

Jintao-Huang

v4.1.3

c6875ef

Patch release v4.1.3

Full Changelog: v4.1.2...v4.1.3

Assets 2

18 Apr 15:56

Jintao-Huang

v4.1.2

3d9907d

Patch release v4.1.2

Full Changelog: v4.1.1...v4.1.2

Assets 2

13 Apr 14:21

Jintao-Huang

v4.1.1

1e43884

Patch release v4.1.1

Full Changelog: v4.1.0...v4.1.1

Assets 2

07 Apr 07:54

Jintao-Huang

v4.1.0

5b40cf2

v4.1.0

中文版

新特性

Megatron-SWIFT
a. mcore-bridge 从 ms-swift 拆分成独立 repo，为最先进模型提供 megatron-core 模型定义：https://github.com/modelscope/mcore-bridge
b. 支持 GRPO Router Replay，使用--router_replay_mode 参数。感谢招商技术团队 @XianlongLi 的贡献。
c. Qwen3.5 解除 TP 数受 num_query_groups 限制的约束，支持 CP 和序列 packing，并支持多模态 MTP。参考 Qwen3.5 最佳实践：https://swift.readthedocs.io/zh-cn/latest/BestPractices/Qwen3_5-Best-Practice.html
d. 新模型支持：GLM-5、Deepseek-v3.2 和 MiniMax2.5。
e. 支持 muon、dist_muon 优化器，训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/megatron/muon.sh
f. 支持 --tuner_type lora_llm，对 LLM 部分使用 LoRA 训练，对 ViT/Aligner 使用全参数训练。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal/lora_llm_vit_full
RL
a. OPSD 算法支持，支持设置教师模型为训练模型并支持设置 teacher_prompt，参考https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html#opsd-on-policy-self-distillation
b. REAL 算法支持，使用 --loss_type real 参数。感谢招商技术团队 @li2zhi 的贡献。
c. 支持 QLoRA GRPO，参考 https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/qlora.sh
d. GRPO K3-KL 计算增加 clamp 操作稳定训练。
e. top-k 默认值从 50 修改为 -1，top-p 默认值从 0.95 修改为 1。
训练
a. 优化 yaml 启动方式的支持，参考：https://github.com/modelscope/ms-swift/tree/main/examples/yaml
b. 新增架构文档：https://swift.readthedocs.io/zh-cn/latest/Customization/Architecture.html
c. 新增 Metax 支持最佳实践：https://swift.readthedocs.io/zh-cn/latest/BestPractices/Metax-support.html
d. 新增通过 uv 安装 ms-swift 的支持。

新模型

纯文本模型
a. MiniMax/MiniMax-M2.5
b. deepseek-ai/DeepSeek-V3.2
c. Alibaba-AAIG/YuFeng-XGuard-Reason-0.6B系列（感谢 @ciaoyizhen 的贡献）
多模态模型
a. google/gemma-4-E2B-it系列，脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4/train.sh

English Version

New Features

Megatron-SWIFT
a. mcore-bridge has been split from ms-swift into an independent repository, providing megatron-core model definitions for state-of-the-art models: https://github.com/modelscope/mcore-bridge
b. Support for GRPO Router Replay via the --router_replay_mode parameter. Thanks to @XianlongLi from the CMB Tech team for the contribution.
c. Qwen3.5 removes the TP size restriction imposed by num_query_groups, with added support for CP, sequence packing, and multimodal MTP. Refer to the Qwen3.5 best practices: https://swift.readthedocs.io/zh-cn/latest/BestPractices/Qwen3_5-Best-Practice.html
d. New model support: GLM-5, DeepSeek-V3.2, and MiniMax2.5.
e. Support for muon and dist_muon optimizers. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/muon.sh
f. Support for --tuner_type lora_llm, enabling LoRA training on the LLM component and full-parameter training on ViT/Aligner. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal/lora_llm_vit_full
RL
a. Support for the OPSD algorithm, with the ability to set the teacher model as the training model and configure teacher_prompt. Refer to: https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html#opsd-on-policy-self-distillation
b. Support for the REAL algorithm via the --loss_type real parameter. Thanks to @li2zhi from the CMB Tech team for the contribution.
c. Support for QLoRA GRPO. Refer to: https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/qlora.sh
d. Added clamp operation to GRPO K3-KL computation for training stability.
e. Changed the default value of top-k from 50 to -1, and top-p from 0.95 to 1.
Training
a. Improved support for YAML-based launch configurations. Refer to: https://github.com/modelscope/ms-swift/tree/main/examples/yaml
b. Added architecture documentation: https://swift.readthedocs.io/zh-cn/latest/Customization/Architecture.html
c. Added Metax support best practices: https://swift.readthedocs.io/zh-cn/latest/BestPractices/Metax-support.html
d. Added support for installing ms-swift via uv.

New Models

Text-Only Models
a. MiniMax/MiniMax-M2.5
b. deepseek-ai/DeepSeek-V3.2
c. Alibaba-AAIG/YuFeng-XGuard-Reason-0.6B series (Thanks to @ciaoyizhen for the contribution)
Multimodal Models
a. google/gemma-4-E2B-it series. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4/train.sh

What's Changed

[docs] update arch docs by @Jintao-Huang in #8185
[docs] update qwen3.5 best practice by @Jintao-Huang in #8189
[docs] fix docs by @Jintao-Huang in #8191
feat(megatron): add on_save callback to MegatronCallback by @inzamam-iqbal in #8187
[model] support qwen3.5 mtp by @Jintao-Huang in #8194
[bugfix] fix minimax 2.1 enable_tp by @Jintao-Huang in #8199
[megatron] comcat mcore 016 by @Jintao-Huang in #8204
feat: Add YuFeng XGuard template support for training by @ciaoyizhen in #8179
[bugfix] fix num_query_groups by @Jintao-Huang in #8206
[bugfix] fix max_shard_size transformers 5.x by @Jintao-Huang in #8209
Fix for load_dataset function to restore ability to use custom loader by @gusario in #8184
[megatron] support GLM-5 megatron by @Jintao-Huang in #8085
[bugfix] fix kimi k2 by @Jintao-Huang in #8229
[megatron] support deepseek-v3.2 by @Jintao-Huang in #8226
[model] support minimax 2.5 by @Jintao-Huang in #8235
[docs] support uv by @Jintao-Huang in #8190
[bugfix] fix megatron kimi_k2 by @Jintao-Huang in #8238
[docs] add swift 4.0 image by @Jintao-Huang in #8242
[docs] compat npu megatron by @Jintao-Huang in #8244
[bugfix] fix eval-generation-config json parse by @hjh0119 in #8246
[bugfix] fix megatron grpo log completion-length by @hjh0119 in #8247
[bugfix] fix callbacks by @Jintao-Huang in #8250
[compat] compat transformers 5.3.0 by @Jintao-Huang in #8249
[fix] update ascend communication and fix megatron issue by @jiaqiw09 in #8243
[megatron] Qwen3.5 supports larger num_query_groups (mcore 0.16) by @Jintao-Huang in #8253
[docs] update docs modelscope.ai by @Jintao-Huang in #8258
[doc] update qwen3.5 best practice doc by @hjh0119 in #8255
[bugfix] fix accelerator by @Jintao-Huang in #8261
add metax best practices by @qq1243196045 in #8251
[docs] fix metax docs index by @Jintao-Huang in #8264
[bugfix] fix gkd load teacher by @hjh0119 in #8265
[bugfix] Fix qwen3 omni image_patch_size by @Jintao-Huang in #8236
[docs] fix metax docs by @Jintao-Huang in #8270
Perf: avoid intermediate tensor allocs via in-place div & optimized top-k flow by @hjh0119 in #8268
[megatonr] update padding_free check by @Jintao-Huang in #8274
[bugfix] fix weight sync with vllm_enable_lora and resume_from_checkpoint by @hjh0119 in #8275
[bugfix] update sync method for different backend by @jiaqiw09 in #8273
[megatron] _get_param_groups compat mcore016 by @Jintao-Huang in #8278
[bugfix] fix trl import vllm_ascend by @hjh0119 in #8280
[bugfix] ignore max_length error by @Jintao-Huang in #8279
fix npu hccl timeout by @addsubmuldiv in #8281
[bugfix] fix tuner_type by @Jintao-Huang in #8283
[megatron] qwen3.5 use megatron-core GDN by @Jintao-Huang in #8282
[docs] update docs by @Jintao-Huang in #8292
[doc] qwen3.5 moe grpo examples by @hjh0119 in #8302
[bugfix] fix tie_word_embeddings seq_cls by @Jintao-Huang in #8297
[bugfix] fix megatron mcore 015 qwen3_5 by @Jintao-Huang in #8311
update npu fsdp example by @addsubmuldiv in #8308
[bugfix] fix mtp rope by @Jintao-Huang in #8316
[megatron] support qwen3_5 packing by @Jintao-Huang in #8313
[bugfix] fix megatron grpo ris by @hjh0119 in #8321
[bugfix] fix megatron gkd tp top-k by @hjh0119 in https://...