Releases: modelscope/ms-swift
Releases · modelscope/ms-swift
Patch release v4.2.3
Full Changelog: v4.2.2...v4.2.3
Patch release v4.2.2
Full Changelog: v4.2.1...v4.2.2
Patch release v4.2.1
Full Changelog: v4.2.0...v4.2.1
v4.2.0
中文版
新特性
- Megatron-SWIFT
a. 新增 model_type 支持:kimi_k25、hy_v3、llava_onevision。(llava_onevision 感谢 @randydl 的贡献)
b. 支持 GLM-5 共享参数 MTP,可通过--mtp_shared_weights参数启用。
c. 支持 Qwen3.5 FP8 训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
d. 自定义 Megatron 模型文档:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Custom-Model.html
e. 支持控制 MTP 分支中decoder_input是否停止梯度,即 MTP loss 能否直接通过 decoder_input 回传梯度到 Embedding/ViT,可通过--mtp_decoder_input_detach参数控制。
f.mlp_padding_free参数兼容序列并行
g. 支持通过megatron export命令进行权重 FP8 量化导出,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
h. 移除对 megatron-core 0.12 - 0.14 版本的依赖兼容支持。 - RL
a. GKD/OPSD 支持设置generation_batch_size/steps_per_generaiton参数。
b. GKD/OPSD teacher_server_api 兼容多模态训练。
c. GKD/OPSD 兼容 padding_free。
d. Megatron GRPO/GKD 权重同步支持仅同步 LoRA 权重。
e. swift rollout 新增异常捕获机制,避免进程静默卡死。
f. GRPO ref_sync_callback 支持在 ZeRO-3 下进行分层 gather,避免 OOM。
g. GRPO TRL 依赖版本升级至 >= 0.26。 - 训练
a. 支持 Qwen3.5 序列并行,可通过--sequence_parallel_size参数控制。(感谢 @meichangsu1 的贡献)
b. 支持在数据集中直接指定loss_scale,提供更灵活的控制方式,参考文档:https://swift.readthedocs.io/zh-cn/latest/Customization/Custom-dataset.html#id4
c. 数据集 datasets 依赖兼容 4.x 版本。
d. cached_dataset 与--truncation_strategy split策略兼容。 - 硬件
a. NPU 支持基于 transformers/Megatron 后端的 Qwen3.5 训练,使用 Megatron 后端时需开启USE_MCORE_GDN=0环境变量。(感谢 @addsubmuldiv、@hazelduan 的贡献)
b. 新增 AMD 支持文档:https://swift.readthedocs.io/zh-cn/latest/BestPractices/AMD-support.html (感谢 @Treemann 的贡献)
c. 支持 Metax 硬件的 RL 训练。(感谢 @suenphey 的贡献)
d. NPU Megatron 训练兼容 megatron-core 0.15.3。(感谢 @addsubmuldiv 的贡献)
新模型
- 纯文本模型
a. ZhipuAI/GLM-5.1
b. MiniMax/MiniMax-M2.7
c. moonshotai/Kimi-K2.6(仅含纯文本)
d. Tencent-Hunyuan/Hy3-preview
e. AIDC-AI/Marco-Nano-Instruct 系列 - 多模态模型
a. Qwen/Qwen3.6-35B-A3B、Qwen/Qwen3.6-27B
b. Qwen3-ASR(感谢 @xut806 的贡献)
c. Gemma4 系列模型混合模态数据集训练支持
d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
e. OpenBMB/MiniCPM-o-4_5 新增音频模态支持(感谢 @fanqiNO1 的贡献)
f. allenai/Molmo2-4B(感谢 @Kagura-0001 的贡献)
English Version
New Features
- Megatron-SWIFT
a. Added model_type support: kimi_k25, hy_v3, llava_onevision. (llava_onevision contributed by @randydl)
b. Added support for GLM-5 shared-parameter MTP, which can be enabled via the--mtp_shared_weightsargument.
c. Added support for Qwen3.5 FP8 training. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
d. Custom Megatron model documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Custom-Model.html
e. Added support for controlling whetherdecoder_inputstops gradient in the MTP branch (i.e., whether MTP loss can backpropagate gradients throughdecoder_inputto Embedding/ViT), configurable via the--mtp_decoder_input_detachargument.
f.mlp_padding_freeis now compatible with Sequence Parallelism.
g. Added support for FP8 quantization export via themegatron exportcommand. Script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
h. Removed dependency compatibility support for megatron-core versions 0.12 - 0.14. - RL
a. GKD/OPSD now supports thegeneration_batch_size/steps_per_generationparameters.
b. GKD/OPSDteacher_server_apiis now compatible with multimodal training.
c. GKD/OPSD is now compatible withpadding_free.
d. Megatron GRPO/GKD weight synchronization now supports syncing LoRA weights only.
e. Added exception handling toswift rolloutto prevent silent process hangs.
f. GRPOref_sync_callbacknow supports layer-wise gather under ZeRO-3 to avoid OOM.
g. GRPO TRL dependency upgraded to>= 0.26. - Training
a. Added support for Qwen3.5 Sequence Parallelism, controllable via the--sequence_parallel_sizeargument. (Contributed by @meichangsu1)
b. Added support for specifyingloss_scaledirectly in the dataset for more flexible loss control. Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#supervised-fine-tuning
c. Dataset dependency is now compatible with datasets 4.x.
d.cached_datasetis now compatible with the--truncation_strategy splitstrategy. - Hardware
a. NPU now supports Qwen3.5 training with transformers/Megatron backends. When using the Megatron backend, theUSE_MCORE_GDN=0environment variable must be set. (Contributed by @addsubmuldiv, @hazelduan)
b. Added AMD support documentation: https://swift.readthedocs.io/en/latest/BestPractices/AMD-support.html (Contributed by @Treemann)
c. Added RL training support for MetaX hardware. (Contributed by @suenphey)
d. NPU Megatron training is now compatible with megatron-core 0.15.3. (Contributed by @addsubmuldiv)
New Models
- Text-only Models
a. ZhipuAI/GLM-5.1
b. MiniMax/MiniMax-M2.7
c. moonshotai/Kimi-K2.6 (text-only)
d. Tencent-Hunyuan/Hy3-preview
e. AIDC-AI/Marco-Nano-Instruct series - Multimodal Models
a. Qwen/Qwen3.6-35B-A3B, Qwen/Qwen3.6-27B
b. Qwen3-ASR (Contributed by @xut806)
c. Added mixed-modality dataset training support for Gemma4 series models.
d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
e. OpenBMB/MiniCPM-o-4_5 now supports audio modality. (Contributed by @fanqiNO1)
f. allenai/Molmo2-4B (Contributed by @Kagura-0001)
What's Changed
- [model] Support GLM-5.1 by @Jintao-Huang in #9038
- [docs] update readme by @Jintao-Huang in #9043
- [docs] update qwen3.5 best practice by @zhangfanTJU in #9039
- [bugfix] sync template.padding_free with args after prepare_model for… by @yaoruda in #9031
- [bugfix] fix gemma4 audio batch by @Jintao-Huang in #9045
- [megatron] refactor forward_step_helper by @Jintao-Huang in #9048
- [megatron] update megatron destroy_process_group by @Jintao-Huang in #9052
- feat: add Qwen3-ASR model support (#8118) by @xut806 in #9034
- [bugfix] fix multi-node server mode weight sync race condition by @sys-reasoner in #9060
- update qwen_asr by @Jintao-Huang in #9061
- [bugfix] fix qwen3_reranker mcore_model_type by @Jintao-Huang in #9062
- [bugfix] fix qwen3 omni template by @addsubmuldiv in #9066
- [docs] add AMD best practices by @Treemann in #9069
- Update npu mindspeed doc and fix new version mindspeed's cp error by @addsubmuldiv in #9067
- [bugfix] fix megatron vllm_engine_kwargs & cosine_max_len by @hjh0119 in #9072
- [bugfix] fix transformers generate default top_k by @hjh0119 in #9071
- [model] support MinerU2.5-Pro by @Jintao-Huang in #9074
- [bugfix] fix megatron pt by @Jintao-Huang in #9075
- [model] Support minimax 2.7 by @Jintao-Huang in #9079
- [bugfix] fix gemma4 31b by @Jintao-Huang in #9080
- [bugfix] fix vllm (0.19.0) qwen3_5 by @Jintao-Huang in #9086
- [bugfix] fix gemma4 zero3 by @Jintao-Huang in #9083
- [bugfix] fix gemma4 system by @Jintao-Huang in #9089
- [bugfix] fix bge-m3 reranker by @Jintao-Huang in #9091
- remove prompt id for megatron grpo by @hjh0119 in #9094
- [docs] update npu docs en by @Jintao-Huang in #9097
- [metax] support pynccl communicator in vllm by @suenphey in #9090
- [bugfix] fix megatron finetune by @Jintao-Huang in #9099
- [grpo] set default load_format auto by @hjh0119 in #9100
- update qr code by @tastelikefeet in #9109
- Optimize weight synchronization for LoRA adapter weights by @hjh0119 in #9077
- support gemma4 vllm multi-modal inference by @hjh0119 in #9105
- [bugfix] fix gptq transformers>=5.0 by @Jintao-Huang in #9042
- [bugfix] Fix gemma4 image template by @Jintao-Huang in #9115
- fix bugs by @hpsun1109 in #9120
- [bugfix] fix vit_gc by @Jintao-Huang in #9125
- [megatron] support qwen3.5 fp8 by @Jintao-Huang in #9106
- fix chunked data slicing in multi-turn GRPO by @hjh0119 in #9128
- [model] support qwen3.6 by @Jintao-Huang in #9129
- [bugfix] fix vllm mtp by @Jintao-Huang in #9138
- Update shell by @Jintao-Huang in #9140
- [model] Support Marco by @Jintao-Huang in https://github.com...
Patch release v4.1.3
Full Changelog: v4.1.2...v4.1.3
Patch release v4.1.2
Full Changelog: v4.1.1...v4.1.2
Patch release v4.1.1
Full Changelog: v4.1.0...v4.1.1
v4.1.0
中文版
新特性
- Megatron-SWIFT
a. mcore-bridge 从 ms-swift 拆分成独立 repo,为最先进模型提供 megatron-core 模型定义:https://github.com/modelscope/mcore-bridge
b. 支持 GRPO Router Replay,使用--router_replay_mode参数。 感谢招商技术团队 @XianlongLi 的贡献。
c. Qwen3.5 解除 TP 数受num_query_groups限制的约束,支持 CP 和序列 packing,并支持多模态 MTP。参考 Qwen3.5 最佳实践:https://swift.readthedocs.io/zh-cn/latest/BestPractices/Qwen3_5-Best-Practice.html
d. 新模型支持:GLM-5、Deepseek-v3.2 和 MiniMax2.5。
e. 支持 muon、dist_muon 优化器,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/muon.sh
f. 支持--tuner_type lora_llm,对 LLM 部分使用 LoRA 训练,对 ViT/Aligner 使用全参数训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal/lora_llm_vit_full - RL
a. OPSD 算法支持,支持设置教师模型为训练模型并支持设置 teacher_prompt,参考https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html#opsd-on-policy-self-distillation
b. REAL 算法支持,使用--loss_type real参数。感谢招商技术团队 @li2zhi 的贡献。
c. 支持 QLoRA GRPO,参考 https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/qlora.sh
d. GRPO K3-KL 计算增加 clamp 操作稳定训练。
e. top-k 默认值从 50 修改为 -1,top-p 默认值从 0.95 修改为 1。 - 训练
a. 优化 yaml 启动方式的支持,参考:https://github.com/modelscope/ms-swift/tree/main/examples/yaml
b. 新增架构文档:https://swift.readthedocs.io/zh-cn/latest/Customization/Architecture.html
c. 新增 Metax 支持最佳实践:https://swift.readthedocs.io/zh-cn/latest/BestPractices/Metax-support.html
d. 新增通过uv安装 ms-swift 的支持。
新模型
- 纯文本模型
a. MiniMax/MiniMax-M2.5
b. deepseek-ai/DeepSeek-V3.2
c. Alibaba-AAIG/YuFeng-XGuard-Reason-0.6B系列 (感谢 @ciaoyizhen 的贡献) - 多模态模型
a. google/gemma-4-E2B-it系列,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4/train.sh
English Version
New Features
- Megatron-SWIFT
a.mcore-bridgehas been split from ms-swift into an independent repository, providing megatron-core model definitions for state-of-the-art models: https://github.com/modelscope/mcore-bridge
b. Support for GRPO Router Replay via the--router_replay_modeparameter. Thanks to @XianlongLi from the CMB Tech team for the contribution.
c. Qwen3.5 removes the TP size restriction imposed bynum_query_groups, with added support for CP, sequence packing, and multimodal MTP. Refer to the Qwen3.5 best practices: https://swift.readthedocs.io/zh-cn/latest/BestPractices/Qwen3_5-Best-Practice.html
d. New model support: GLM-5, DeepSeek-V3.2, and MiniMax2.5.
e. Support formuonanddist_muonoptimizers. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/muon.sh
f. Support for--tuner_type lora_llm, enabling LoRA training on the LLM component and full-parameter training on ViT/Aligner. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/multimodal/lora_llm_vit_full - RL
a. Support for the OPSD algorithm, with the ability to set the teacher model as the training model and configureteacher_prompt. Refer to: https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html#opsd-on-policy-self-distillation
b. Support for the REAL algorithm via the--loss_type realparameter. Thanks to @li2zhi from the CMB Tech team for the contribution.
c. Support for QLoRA GRPO. Refer to: https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/internal/qlora.sh
d. Added clamp operation to GRPO K3-KL computation for training stability.
e. Changed the default value oftop-kfrom 50 to -1, andtop-pfrom 0.95 to 1. - Training
a. Improved support for YAML-based launch configurations. Refer to: https://github.com/modelscope/ms-swift/tree/main/examples/yaml
b. Added architecture documentation: https://swift.readthedocs.io/zh-cn/latest/Customization/Architecture.html
c. Added Metax support best practices: https://swift.readthedocs.io/zh-cn/latest/BestPractices/Metax-support.html
d. Added support for installing ms-swift viauv.
New Models
- Text-Only Models
a. MiniMax/MiniMax-M2.5
b. deepseek-ai/DeepSeek-V3.2
c. Alibaba-AAIG/YuFeng-XGuard-Reason-0.6B series (Thanks to @ciaoyizhen for the contribution) - Multimodal Models
a. google/gemma-4-E2B-it series. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4/train.sh
What's Changed
- [docs] update arch docs by @Jintao-Huang in #8185
- [docs] update qwen3.5 best practice by @Jintao-Huang in #8189
- [docs] fix docs by @Jintao-Huang in #8191
- feat(megatron): add on_save callback to MegatronCallback by @inzamam-iqbal in #8187
- [model] support qwen3.5 mtp by @Jintao-Huang in #8194
- [bugfix] fix minimax 2.1 enable_tp by @Jintao-Huang in #8199
- [megatron] comcat mcore 016 by @Jintao-Huang in #8204
- feat: Add YuFeng XGuard template support for training by @ciaoyizhen in #8179
- [bugfix] fix num_query_groups by @Jintao-Huang in #8206
- [bugfix] fix max_shard_size transformers 5.x by @Jintao-Huang in #8209
- Fix for load_dataset function to restore ability to use custom loader by @gusario in #8184
- [megatron] support GLM-5 megatron by @Jintao-Huang in #8085
- [bugfix] fix kimi k2 by @Jintao-Huang in #8229
- [megatron] support deepseek-v3.2 by @Jintao-Huang in #8226
- [model] support minimax 2.5 by @Jintao-Huang in #8235
- [docs] support uv by @Jintao-Huang in #8190
- [bugfix] fix megatron kimi_k2 by @Jintao-Huang in #8238
- [docs] add swift 4.0 image by @Jintao-Huang in #8242
- [docs] compat npu megatron by @Jintao-Huang in #8244
- [bugfix] fix eval-generation-config json parse by @hjh0119 in #8246
- [bugfix] fix megatron grpo log completion-length by @hjh0119 in #8247
- [bugfix] fix callbacks by @Jintao-Huang in #8250
- [compat] compat transformers 5.3.0 by @Jintao-Huang in #8249
- [fix] update ascend communication and fix megatron issue by @jiaqiw09 in #8243
- [megatron] Qwen3.5 supports larger num_query_groups (mcore 0.16) by @Jintao-Huang in #8253
- [docs] update docs modelscope.ai by @Jintao-Huang in #8258
- [doc] update qwen3.5 best practice doc by @hjh0119 in #8255
- [bugfix] fix accelerator by @Jintao-Huang in #8261
- add metax best practices by @qq1243196045 in #8251
- [docs] fix metax docs index by @Jintao-Huang in #8264
- [bugfix] fix gkd load teacher by @hjh0119 in #8265
- [bugfix] Fix qwen3 omni image_patch_size by @Jintao-Huang in #8236
- [docs] fix metax docs by @Jintao-Huang in #8270
- Perf: avoid intermediate tensor allocs via in-place div & optimized top-k flow by @hjh0119 in #8268
- [megatonr] update padding_free check by @Jintao-Huang in #8274
- [bugfix] fix weight sync with vllm_enable_lora and resume_from_checkpoint by @hjh0119 in #8275
- [bugfix] update sync method for different backend by @jiaqiw09 in #8273
- [megatron] _get_param_groups compat mcore016 by @Jintao-Huang in #8278
- [bugfix] fix trl import vllm_ascend by @hjh0119 in #8280
- [bugfix] ignore max_length error by @Jintao-Huang in #8279
- fix npu hccl timeout by @addsubmuldiv in #8281
- [bugfix] fix tuner_type by @Jintao-Huang in #8283
- [megatron] qwen3.5 use megatron-core GDN by @Jintao-Huang in #8282
- [docs] update docs by @Jintao-Huang in #8292
- [doc] qwen3.5 moe grpo examples by @hjh0119 in #8302
- [bugfix] fix tie_word_embeddings seq_cls by @Jintao-Huang in #8297
- [bugfix] fix megatron mcore 015 qwen3_5 by @Jintao-Huang in #8311
- update npu fsdp example by @addsubmuldiv in #8308
- [bugfix] fix mtp rope by @Jintao-Huang in #8316
- [megatron] support qwen3_5 packing by @Jintao-Huang in #8313
- [bugfix] fix megatron grpo ris by @hjh0119 in #8321
- [bugfix] fix megatron gkd tp top-k by @hjh0119 in https://...
Patch release v4.0.4
Full Changelog: v4.0.3...v4.0.4
Patch release v4.0.3
Full Changelog: v4.0.2...v4.0.3