v4.2.0
中文版
新特性
- Megatron-SWIFT
a. 新增 model_type 支持:kimi_k25、hy_v3、llava_onevision。(llava_onevision 感谢 @randydl 的贡献)
b. 支持 GLM-5 共享参数 MTP,可通过--mtp_shared_weights参数启用。
c. 支持 Qwen3.5 FP8 训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
d. 自定义 Megatron 模型文档:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Custom-Model.html
e. 支持控制 MTP 分支中decoder_input是否停止梯度,即 MTP loss 能否直接通过 decoder_input 回传梯度到 Embedding/ViT,可通过--mtp_decoder_input_detach参数控制。
f.mlp_padding_free参数兼容序列并行
g. 支持通过megatron export命令进行权重 FP8 量化导出,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
h. 移除对 megatron-core 0.12 - 0.14 版本的依赖兼容支持。 - RL
a. GKD/OPSD 支持设置generation_batch_size/steps_per_generaiton参数。
b. GKD/OPSD teacher_server_api 兼容多模态训练。
c. GKD/OPSD 兼容 padding_free。
d. Megatron GRPO/GKD 权重同步支持仅同步 LoRA 权重。
e. swift rollout 新增异常捕获机制,避免进程静默卡死。
f. GRPO ref_sync_callback 支持在 ZeRO-3 下进行分层 gather,避免 OOM。
g. GRPO TRL 依赖版本升级至 >= 0.26。 - 训练
a. 支持 Qwen3.5 序列并行,可通过--sequence_parallel_size参数控制。(感谢 @meichangsu1 的贡献)
b. 支持在数据集中直接指定loss_scale,提供更灵活的控制方式,参考文档:https://swift.readthedocs.io/zh-cn/latest/Customization/Custom-dataset.html#id4
c. 数据集 datasets 依赖兼容 4.x 版本。
d. cached_dataset 与--truncation_strategy split策略兼容。 - 硬件
a. NPU 支持基于 transformers/Megatron 后端的 Qwen3.5 训练,使用 Megatron 后端时需开启USE_MCORE_GDN=0环境变量。(感谢 @addsubmuldiv、@hazelduan 的贡献)
b. 新增 AMD 支持文档:https://swift.readthedocs.io/zh-cn/latest/BestPractices/AMD-support.html (感谢 @Treemann 的贡献)
c. 支持 Metax 硬件的 RL 训练。(感谢 @suenphey 的贡献)
d. NPU Megatron 训练兼容 megatron-core 0.15.3。(感谢 @addsubmuldiv 的贡献)
新模型
- 纯文本模型
a. ZhipuAI/GLM-5.1
b. MiniMax/MiniMax-M2.7
c. moonshotai/Kimi-K2.6(仅含纯文本)
d. Tencent-Hunyuan/Hy3-preview
e. AIDC-AI/Marco-Nano-Instruct 系列 - 多模态模型
a. Qwen/Qwen3.6-35B-A3B、Qwen/Qwen3.6-27B
b. Qwen3-ASR(感谢 @xut806 的贡献)
c. Gemma4 系列模型混合模态数据集训练支持
d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
e. OpenBMB/MiniCPM-o-4_5 新增音频模态支持(感谢 @fanqiNO1 的贡献)
f. allenai/Molmo2-4B(感谢 @Kagura-0001 的贡献)
English Version
New Features
- Megatron-SWIFT
a. Added model_type support: kimi_k25, hy_v3, llava_onevision. (llava_onevision contributed by @randydl)
b. Added support for GLM-5 shared-parameter MTP, which can be enabled via the--mtp_shared_weightsargument.
c. Added support for Qwen3.5 FP8 training. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
d. Custom Megatron model documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Custom-Model.html
e. Added support for controlling whetherdecoder_inputstops gradient in the MTP branch (i.e., whether MTP loss can backpropagate gradients throughdecoder_inputto Embedding/ViT), configurable via the--mtp_decoder_input_detachargument.
f.mlp_padding_freeis now compatible with Sequence Parallelism.
g. Added support for FP8 quantization export via themegatron exportcommand. Script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
h. Removed dependency compatibility support for megatron-core versions 0.12 - 0.14. - RL
a. GKD/OPSD now supports thegeneration_batch_size/steps_per_generationparameters.
b. GKD/OPSDteacher_server_apiis now compatible with multimodal training.
c. GKD/OPSD is now compatible withpadding_free.
d. Megatron GRPO/GKD weight synchronization now supports syncing LoRA weights only.
e. Added exception handling toswift rolloutto prevent silent process hangs.
f. GRPOref_sync_callbacknow supports layer-wise gather under ZeRO-3 to avoid OOM.
g. GRPO TRL dependency upgraded to>= 0.26. - Training
a. Added support for Qwen3.5 Sequence Parallelism, controllable via the--sequence_parallel_sizeargument. (Contributed by @meichangsu1)
b. Added support for specifyingloss_scaledirectly in the dataset for more flexible loss control. Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#supervised-fine-tuning
c. Dataset dependency is now compatible with datasets 4.x.
d.cached_datasetis now compatible with the--truncation_strategy splitstrategy. - Hardware
a. NPU now supports Qwen3.5 training with transformers/Megatron backends. When using the Megatron backend, theUSE_MCORE_GDN=0environment variable must be set. (Contributed by @addsubmuldiv, @hazelduan)
b. Added AMD support documentation: https://swift.readthedocs.io/en/latest/BestPractices/AMD-support.html (Contributed by @Treemann)
c. Added RL training support for MetaX hardware. (Contributed by @suenphey)
d. NPU Megatron training is now compatible with megatron-core 0.15.3. (Contributed by @addsubmuldiv)
New Models
- Text-only Models
a. ZhipuAI/GLM-5.1
b. MiniMax/MiniMax-M2.7
c. moonshotai/Kimi-K2.6 (text-only)
d. Tencent-Hunyuan/Hy3-preview
e. AIDC-AI/Marco-Nano-Instruct series - Multimodal Models
a. Qwen/Qwen3.6-35B-A3B, Qwen/Qwen3.6-27B
b. Qwen3-ASR (Contributed by @xut806)
c. Added mixed-modality dataset training support for Gemma4 series models.
d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
e. OpenBMB/MiniCPM-o-4_5 now supports audio modality. (Contributed by @fanqiNO1)
f. allenai/Molmo2-4B (Contributed by @Kagura-0001)
What's Changed
- [model] Support GLM-5.1 by @Jintao-Huang in #9038
- [docs] update readme by @Jintao-Huang in #9043
- [docs] update qwen3.5 best practice by @zhangfanTJU in #9039
- [bugfix] sync template.padding_free with args after prepare_model for… by @yaoruda in #9031
- [bugfix] fix gemma4 audio batch by @Jintao-Huang in #9045
- [megatron] refactor forward_step_helper by @Jintao-Huang in #9048
- [megatron] update megatron destroy_process_group by @Jintao-Huang in #9052
- feat: add Qwen3-ASR model support (#8118) by @xut806 in #9034
- [bugfix] fix multi-node server mode weight sync race condition by @sys-reasoner in #9060
- update qwen_asr by @Jintao-Huang in #9061
- [bugfix] fix qwen3_reranker mcore_model_type by @Jintao-Huang in #9062
- [bugfix] fix qwen3 omni template by @addsubmuldiv in #9066
- [docs] add AMD best practices by @Treemann in #9069
- Update npu mindspeed doc and fix new version mindspeed's cp error by @addsubmuldiv in #9067
- [bugfix] fix megatron vllm_engine_kwargs & cosine_max_len by @hjh0119 in #9072
- [bugfix] fix transformers generate default top_k by @hjh0119 in #9071
- [model] support MinerU2.5-Pro by @Jintao-Huang in #9074
- [bugfix] fix megatron pt by @Jintao-Huang in #9075
- [model] Support minimax 2.7 by @Jintao-Huang in #9079
- [bugfix] fix gemma4 31b by @Jintao-Huang in #9080
- [bugfix] fix vllm (0.19.0) qwen3_5 by @Jintao-Huang in #9086
- [bugfix] fix gemma4 zero3 by @Jintao-Huang in #9083
- [bugfix] fix gemma4 system by @Jintao-Huang in #9089
- [bugfix] fix bge-m3 reranker by @Jintao-Huang in #9091
- remove prompt id for megatron grpo by @hjh0119 in #9094
- [docs] update npu docs en by @Jintao-Huang in #9097
- [metax] support pynccl communicator in vllm by @suenphey in #9090
- [bugfix] fix megatron finetune by @Jintao-Huang in #9099
- [grpo] set default load_format auto by @hjh0119 in #9100
- update qr code by @tastelikefeet in #9109
- Optimize weight synchronization for LoRA adapter weights by @hjh0119 in #9077
- support gemma4 vllm multi-modal inference by @hjh0119 in #9105
- [bugfix] fix gptq transformers>=5.0 by @Jintao-Huang in #9042
- [bugfix] Fix gemma4 image template by @Jintao-Huang in #9115
- fix bugs by @hpsun1109 in #9120
- [bugfix] fix vit_gc by @Jintao-Huang in #9125
- [megatron] support qwen3.5 fp8 by @Jintao-Huang in #9106
- fix chunked data slicing in multi-turn GRPO by @hjh0119 in #9128
- [model] support qwen3.6 by @Jintao-Huang in #9129
- [bugfix] fix vllm mtp by @Jintao-Huang in #9138
- Update shell by @Jintao-Huang in #9140
- [model] Support Marco by @Jintao-Huang in #9137
- [bugfix] fix opsd transformer generate by @hjh0119 in #9145
- [megatron] mtp_decoder_input_detach by @Jintao-Huang in #9146
- [Feature] Add Molmo2 support (image + video inference, LoRA SFT) by @Kagura-0001 in #9063
- [docs] update docs by @Jintao-Huang in #9148
- [megatron] support mtp_shared_weights by @Jintao-Huang in #9151
- update swift image 4.1 by @Jintao-Huang in #9153
- feat: support audio input for minicpm-o-4_5 by @fanqiNO1 in #9147
- [model] update minicpmo 4_5 by @Jintao-Huang in #9159
- [bugfix] fix mtp keys by @Jintao-Huang in #9163
- [bugfix] fix qwen3_omni infer by @Jintao-Huang in #9164
- feat(qwen): add sequence parallel support for Qwen3.5 linear attention by @meichangsu1 in #9162
- [bugfix] fix qwen3_5 sp compat transformers 5.5.4 by @Jintao-Huang in #9165
- Fix megatron save oom by @Jintao-Huang in #9166
- [bugfix] fix eval loss denominator under sequence_parallel by @YarivColbeci in #9152
- Add sequence parallel compatibility with transformers >= 5.4.0 by @Jintao-Huang in #9167
- [bugfix] fix megatron minimax save hang by @Jintao-Huang in #9171
- [bugfix] fix optimizer deepspeed by @Jintao-Huang in #9173
- [bugfix] Fix megatron save_total_limit & pp by @Jintao-Huang in #9175
- [bugfix] fix docs by @Jintao-Huang in #9176
- [bugfix] fix qwen3 omni audio 30s by @Jintao-Huang in #9182
- [bugfix] fix grpo generate by @Jintao-Huang in #9183
- [model] support Qwen/Qwen3.6-27B by @Jintao-Huang in #9184
- [bugfix] fix qwen3.5 sp by @Jintao-Huang in #9189
- [compat] compat peft 0.19 by @Jintao-Huang in #9192
- [trainer] optimize use_logits_to_keep by @Jintao-Huang in #9194
- [bugfix] fix seq_cls zero3 by @Jintao-Huang in #9190
- npu qwen3.5 megatron padding_free fix by @addsubmuldiv in #9196
- Support zero3 hierarchical gather in the ref sync callback by @hjh0119 in #9170
- support multi-modal training for gkd teacher api by @hjh0119 in #9197
- Fixing issue with video loading for Gemma 4 with relative paths by @perone in #9201
- [bugfix] fix CI by @Jintao-Huang in #9209
- NPU patch FLA by @hazelduan in #9195
- [bugfix] fix cache_dataset truncation_strategy by @Jintao-Huang in #9210
- support truncation_strategy split & cached_dataset (qwen3.5) by @Jintao-Huang in #9211
- [docs] update swift image 4.1.3 by @Jintao-Huang in #9213
- [dataset] support "loss_scale" in dataset by @Jintao-Huang in #9214
- [model] support hy3 preview by @Jintao-Huang in #9198
- [bugfix] fix agent_template test by @Jintao-Huang in #9215
- [model] support kimi k2.6 (only text) by @Jintao-Huang in #9186
- [template] remove template remove_response by @Jintao-Huang in #9217
- support fa4 by @Jintao-Huang in #9218
- [bugfix] fix ignore_data_skip by @Jintao-Huang in #9220
- Npu patcher refactor by @addsubmuldiv in #9223
- [bugfix] Fix lora llm resume from checkpoint by @Jintao-Huang in #9225
- [model] refactor ling model_type by @Jintao-Huang in #9232
- fix(rollout): add exception capture and non-blocking poll in by @hjh0119 in #9229
- fix(colocate): vllm triton error by @hjh0119 in #9233
- [model] support gemma4 mixed data by @Jintao-Huang in #9180
- [bugfix] cached_dataset reduce disk usage by @Jintao-Huang in #9242
- [docs] fix docs by @Jintao-Huang in #9244
- [docs] update wechat by @Jintao-Huang in #9247
- Document Qwen3.5 FLA patch for NPU support by @hazelduan in #9237
- [docs] update megatron docs by @Jintao-Huang in #9249
- [docs] update docs by @Jintao-Huang in #9250
- Update datasets requirements by @Jintao-Huang in #9252
- [megatron] remove megatron core 0.12-0.14 by @Jintao-Huang in #9260
- update mlp_padding_free by @Jintao-Huang in #9262
- Document FLA/MindSpeed replacement for Qwen3.5 on NPU by @hazelduan in #9238
- Npu doc update by @addsubmuldiv in #9245
- [bugfix] fix tool_call loss_scale by @Jintao-Huang in #9266
- update requirements by @Jintao-Huang in #9275
- [bugfix] fix qwen3_5 template by @Jintao-Huang in #9279
- [Bug fix] Adapt SwiftMixin.create_optimizer signature for transformers >= 4.40 by @ys2025-AI in #9281
- [bugfix] fix grpo rollout step by @hjh0119 in #9264
- [bugfix] fix create_optimizer by @Jintao-Huang in #9282
- [gkd] support buffers & fix some bugs by @hjh0119 in #9278
New Contributors
- @zhangfanTJU made their first contribution in #9039
- @yaoruda made their first contribution in #9031
- @xut806 made their first contribution in #9034
- @sys-reasoner made their first contribution in #9060
- @Treemann made their first contribution in #9069
- @suenphey made their first contribution in #9090
- @Kagura-0001 made their first contribution in #9063
- @fanqiNO1 made their first contribution in #9147
- @YarivColbeci made their first contribution in #9152
- @perone made their first contribution in #9201
- @hazelduan made their first contribution in #9195
- @ys2025-AI made their first contribution in #9281
Full Changelog: v4.1.0...v4.2.0