Release v4.2.0 · modelscope/ms-swift

中文版

新特性

Megatron-SWIFT
a. 新增 model_type 支持：kimi_k25、hy_v3、llava_onevision。（llava_onevision 感谢 @randydl 的贡献）
b. 支持 GLM-5 共享参数 MTP，可通过 --mtp_shared_weights 参数启用。
c. 支持 Qwen3.5 FP8 训练，训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
d. 自定义 Megatron 模型文档：https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Custom-Model.html
e. 支持控制 MTP 分支中 decoder_input 是否停止梯度，即 MTP loss 能否直接通过 decoder_input 回传梯度到 Embedding/ViT，可通过 --mtp_decoder_input_detach 参数控制。
f. mlp_padding_free 参数兼容序列并行
g. 支持通过 megatron export 命令进行权重 FP8 量化导出，脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
h. 移除对 megatron-core 0.12 - 0.14 版本的依赖兼容支持。
RL
a. GKD/OPSD 支持设置 generation_batch_size/steps_per_generaiton 参数。
b. GKD/OPSD teacher_server_api 兼容多模态训练。
c. GKD/OPSD 兼容 padding_free。
d. Megatron GRPO/GKD 权重同步支持仅同步 LoRA 权重。
e. swift rollout 新增异常捕获机制，避免进程静默卡死。
f. GRPO ref_sync_callback 支持在 ZeRO-3 下进行分层 gather，避免 OOM。
g. GRPO TRL 依赖版本升级至 >= 0.26。
训练
a. 支持 Qwen3.5 序列并行，可通过 --sequence_parallel_size 参数控制。（感谢 @meichangsu1 的贡献）
b. 支持在数据集中直接指定 loss_scale，提供更灵活的控制方式，参考文档：https://swift.readthedocs.io/zh-cn/latest/Customization/Custom-dataset.html#id4
c. 数据集 datasets 依赖兼容 4.x 版本。
d. cached_dataset 与 --truncation_strategy split 策略兼容。
硬件
a. NPU 支持基于 transformers/Megatron 后端的 Qwen3.5 训练，使用 Megatron 后端时需开启 USE_MCORE_GDN=0 环境变量。（感谢 @addsubmuldiv、@hazelduan 的贡献）
b. 新增 AMD 支持文档：https://swift.readthedocs.io/zh-cn/latest/BestPractices/AMD-support.html （感谢 @Treemann 的贡献）
c. 支持 Metax 硬件的 RL 训练。（感谢 @suenphey 的贡献）
d. NPU Megatron 训练兼容 megatron-core 0.15.3。（感谢 @addsubmuldiv 的贡献）

新模型

纯文本模型
a. ZhipuAI/GLM-5.1
b. MiniMax/MiniMax-M2.7
c. moonshotai/Kimi-K2.6（仅含纯文本）
d. Tencent-Hunyuan/Hy3-preview
e. AIDC-AI/Marco-Nano-Instruct 系列
多模态模型
a. Qwen/Qwen3.6-35B-A3B、Qwen/Qwen3.6-27B
b. Qwen3-ASR（感谢 @xut806 的贡献）
c. Gemma4 系列模型混合模态数据集训练支持
d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
e. OpenBMB/MiniCPM-o-4_5 新增音频模态支持（感谢 @fanqiNO1 的贡献）
f. allenai/Molmo2-4B（感谢 @Kagura-0001 的贡献）

English Version

New Features

Megatron-SWIFT
a. Added model_type support: kimi_k25, hy_v3, llava_onevision. (llava_onevision contributed by @randydl)
b. Added support for GLM-5 shared-parameter MTP, which can be enabled via the --mtp_shared_weights argument.
c. Added support for Qwen3.5 FP8 training. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
d. Custom Megatron model documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Custom-Model.html
e. Added support for controlling whether decoder_input stops gradient in the MTP branch (i.e., whether MTP loss can backpropagate gradients through decoder_input to Embedding/ViT), configurable via the --mtp_decoder_input_detach argument.
f. mlp_padding_free is now compatible with Sequence Parallelism.
g. Added support for FP8 quantization export via the megatron export command. Script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
h. Removed dependency compatibility support for megatron-core versions 0.12 - 0.14.
RL
a. GKD/OPSD now supports the generation_batch_size/steps_per_generation parameters.
b. GKD/OPSD teacher_server_api is now compatible with multimodal training.
c. GKD/OPSD is now compatible with padding_free.
d. Megatron GRPO/GKD weight synchronization now supports syncing LoRA weights only.
e. Added exception handling to swift rollout to prevent silent process hangs.
f. GRPO ref_sync_callback now supports layer-wise gather under ZeRO-3 to avoid OOM.
g. GRPO TRL dependency upgraded to >= 0.26.
Training
a. Added support for Qwen3.5 Sequence Parallelism, controllable via the --sequence_parallel_size argument. (Contributed by @meichangsu1)
b. Added support for specifying loss_scale directly in the dataset for more flexible loss control. Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#supervised-fine-tuning
c. Dataset dependency is now compatible with datasets 4.x.
d. cached_dataset is now compatible with the --truncation_strategy split strategy.
Hardware
a. NPU now supports Qwen3.5 training with transformers/Megatron backends. When using the Megatron backend, the USE_MCORE_GDN=0 environment variable must be set. (Contributed by @addsubmuldiv, @hazelduan)
b. Added AMD support documentation: https://swift.readthedocs.io/en/latest/BestPractices/AMD-support.html (Contributed by @Treemann)
c. Added RL training support for MetaX hardware. (Contributed by @suenphey)
d. NPU Megatron training is now compatible with megatron-core 0.15.3. (Contributed by @addsubmuldiv)

New Models

Text-only Models
a. ZhipuAI/GLM-5.1
b. MiniMax/MiniMax-M2.7
c. moonshotai/Kimi-K2.6 (text-only)
d. Tencent-Hunyuan/Hy3-preview
e. AIDC-AI/Marco-Nano-Instruct series
Multimodal Models
a. Qwen/Qwen3.6-35B-A3B, Qwen/Qwen3.6-27B
b. Qwen3-ASR (Contributed by @xut806)
c. Added mixed-modality dataset training support for Gemma4 series models.
d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
e. OpenBMB/MiniCPM-o-4_5 now supports audio modality. (Contributed by @fanqiNO1)
f. allenai/Molmo2-4B (Contributed by @Kagura-0001)

What's Changed

[model] Support GLM-5.1 by @Jintao-Huang in #9038
[docs] update readme by @Jintao-Huang in #9043
[docs] update qwen3.5 best practice by @zhangfanTJU in #9039
[bugfix] sync template.padding_free with args after prepare_model for… by @yaoruda in #9031
[bugfix] fix gemma4 audio batch by @Jintao-Huang in #9045
[megatron] refactor forward_step_helper by @Jintao-Huang in #9048
[megatron] update megatron destroy_process_group by @Jintao-Huang in #9052
feat: add Qwen3-ASR model support (#8118) by @xut806 in #9034
[bugfix] fix multi-node server mode weight sync race condition by @sys-reasoner in #9060
update qwen_asr by @Jintao-Huang in #9061
[bugfix] fix qwen3_reranker mcore_model_type by @Jintao-Huang in #9062
[bugfix] fix qwen3 omni template by @addsubmuldiv in #9066
[docs] add AMD best practices by @Treemann in #9069
Update npu mindspeed doc and fix new version mindspeed's cp error by @addsubmuldiv in #9067
[bugfix] fix megatron vllm_engine_kwargs & cosine_max_len by @hjh0119 in #9072
[bugfix] fix transformers generate default top_k by @hjh0119 in #9071
[model] support MinerU2.5-Pro by @Jintao-Huang in #9074
[bugfix] fix megatron pt by @Jintao-Huang in #9075
[model] Support minimax 2.7 by @Jintao-Huang in #9079
[bugfix] fix gemma4 31b by @Jintao-Huang in #9080
[bugfix] fix vllm (0.19.0) qwen3_5 by @Jintao-Huang in #9086
[bugfix] fix gemma4 zero3 by @Jintao-Huang in #9083
[bugfix] fix gemma4 system by @Jintao-Huang in #9089
[bugfix] fix bge-m3 reranker by @Jintao-Huang in #9091
remove prompt id for megatron grpo by @hjh0119 in #9094
[docs] update npu docs en by @Jintao-Huang in #9097
[metax] support pynccl communicator in vllm by @suenphey in #9090
[bugfix] fix megatron finetune by @Jintao-Huang in #9099
[grpo] set default load_format auto by @hjh0119 in #9100
update qr code by @tastelikefeet in #9109
Optimize weight synchronization for LoRA adapter weights by @hjh0119 in #9077
support gemma4 vllm multi-modal inference by @hjh0119 in #9105
[bugfix] fix gptq transformers>=5.0 by @Jintao-Huang in #9042
[bugfix] Fix gemma4 image template by @Jintao-Huang in #9115
fix bugs by @hpsun1109 in #9120
[bugfix] fix vit_gc by @Jintao-Huang in #9125
[megatron] support qwen3.5 fp8 by @Jintao-Huang in #9106
fix chunked data slicing in multi-turn GRPO by @hjh0119 in #9128
[model] support qwen3.6 by @Jintao-Huang in #9129
[bugfix] fix vllm mtp by @Jintao-Huang in #9138
Update shell by @Jintao-Huang in #9140
[model] Support Marco by @Jintao-Huang in #9137
[bugfix] fix opsd transformer generate by @hjh0119 in #9145
[megatron] mtp_decoder_input_detach by @Jintao-Huang in #9146
[Feature] Add Molmo2 support (image + video inference, LoRA SFT) by @Kagura-0001 in #9063
[docs] update docs by @Jintao-Huang in #9148
[megatron] support mtp_shared_weights by @Jintao-Huang in #9151
update swift image 4.1 by @Jintao-Huang in #9153
feat: support audio input for minicpm-o-4_5 by @fanqiNO1 in #9147
[model] update minicpmo 4_5 by @Jintao-Huang in #9159
[bugfix] fix mtp keys by @Jintao-Huang in #9163
[bugfix] fix qwen3_omni infer by @Jintao-Huang in #9164
feat(qwen): add sequence parallel support for Qwen3.5 linear attention by @meichangsu1 in #9162
[bugfix] fix qwen3_5 sp compat transformers 5.5.4 by @Jintao-Huang in #9165
Fix megatron save oom by @Jintao-Huang in #9166
[bugfix] fix eval loss denominator under sequence_parallel by @YarivColbeci in #9152
Add sequence parallel compatibility with transformers >= 5.4.0 by @Jintao-Huang in #9167
[bugfix] fix megatron minimax save hang by @Jintao-Huang in #9171
[bugfix] fix optimizer deepspeed by @Jintao-Huang in #9173
[bugfix] Fix megatron save_total_limit & pp by @Jintao-Huang in #9175
[bugfix] fix docs by @Jintao-Huang in #9176
[bugfix] fix qwen3 omni audio 30s by @Jintao-Huang in #9182
[bugfix] fix grpo generate by @Jintao-Huang in #9183
[model] support Qwen/Qwen3.6-27B by @Jintao-Huang in #9184
[bugfix] fix qwen3.5 sp by @Jintao-Huang in #9189
[compat] compat peft 0.19 by @Jintao-Huang in #9192
[trainer] optimize use_logits_to_keep by @Jintao-Huang in #9194
[bugfix] fix seq_cls zero3 by @Jintao-Huang in #9190
npu qwen3.5 megatron padding_free fix by @addsubmuldiv in #9196
Support zero3 hierarchical gather in the ref sync callback by @hjh0119 in #9170
support multi-modal training for gkd teacher api by @hjh0119 in #9197
Fixing issue with video loading for Gemma 4 with relative paths by @perone in #9201
[bugfix] fix CI by @Jintao-Huang in #9209
NPU patch FLA by @hazelduan in #9195
[bugfix] fix cache_dataset truncation_strategy by @Jintao-Huang in #9210
support truncation_strategy split & cached_dataset (qwen3.5) by @Jintao-Huang in #9211
[docs] update swift image 4.1.3 by @Jintao-Huang in #9213
[dataset] support "loss_scale" in dataset by @Jintao-Huang in #9214
[model] support hy3 preview by @Jintao-Huang in #9198
[bugfix] fix agent_template test by @Jintao-Huang in #9215
[model] support kimi k2.6 (only text) by @Jintao-Huang in #9186
[template] remove template remove_response by @Jintao-Huang in #9217
support fa4 by @Jintao-Huang in #9218
[bugfix] fix ignore_data_skip by @Jintao-Huang in #9220
Npu patcher refactor by @addsubmuldiv in #9223
[bugfix] Fix lora llm resume from checkpoint by @Jintao-Huang in #9225
[model] refactor ling model_type by @Jintao-Huang in #9232
fix(rollout): add exception capture and non-blocking poll in by @hjh0119 in #9229
fix(colocate): vllm triton error by @hjh0119 in #9233
[model] support gemma4 mixed data by @Jintao-Huang in #9180
[bugfix] cached_dataset reduce disk usage by @Jintao-Huang in #9242
[docs] fix docs by @Jintao-Huang in #9244
[docs] update wechat by @Jintao-Huang in #9247
Document Qwen3.5 FLA patch for NPU support by @hazelduan in #9237
[docs] update megatron docs by @Jintao-Huang in #9249
[docs] update docs by @Jintao-Huang in #9250
Update datasets requirements by @Jintao-Huang in #9252
[megatron] remove megatron core 0.12-0.14 by @Jintao-Huang in #9260
update mlp_padding_free by @Jintao-Huang in #9262
Document FLA/MindSpeed replacement for Qwen3.5 on NPU by @hazelduan in #9238
Npu doc update by @addsubmuldiv in #9245
[bugfix] fix tool_call loss_scale by @Jintao-Huang in #9266
update requirements by @Jintao-Huang in #9275
[bugfix] fix qwen3_5 template by @Jintao-Huang in #9279
[Bug fix] Adapt SwiftMixin.create_optimizer signature for transformers >= 4.40 by @ys2025-AI in #9281
[bugfix] fix grpo rollout step by @hjh0119 in #9264
[bugfix] fix create_optimizer by @Jintao-Huang in #9282
[gkd] support buffers & fix some bugs by @hjh0119 in #9278

New Contributors

@zhangfanTJU made their first contribution in #9039
@yaoruda made their first contribution in #9031
@xut806 made their first contribution in #9034
@sys-reasoner made their first contribution in #9060
@Treemann made their first contribution in #9069
@suenphey made their first contribution in #9090
@Kagura-0001 made their first contribution in #9063
@fanqiNO1 made their first contribution in #9147
@YarivColbeci made their first contribution in #9152
@perone made their first contribution in #9201
@hazelduan made their first contribution in #9195
@ys2025-AI made their first contribution in #9281

Full Changelog: v4.1.0...v4.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

中文版

新特性

新模型

English Version

New Features

New Models

What's Changed

New Contributors

Contributors

Uh oh!