Releases · modelscope/ms-swift

Megatron-SWIFT
a. 支持 GRPO Megatron 训练，训练文档参考：https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/GRPO.html
b. FP8 blockwise 训练支持，支持FP8加载和导出权重，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/fp8
c. MTP 训练支持，训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora/mtp.sh
d. 新模型支持：GPT-OSS，Llama4，InternVL3.5-GPT-OSS等。
e. 支持 --save_strategy epoch 策略存储模型。
f. 兼容 megaron-core 0.12-0.15 版本。
RL
a. 新算法 SAPO 支持，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/SAPO.html
b. 新算法 CISPO 支持，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/CISPO.html
c. 缓解训推不一致的算法支持，包括 TIS/MIS 与 rollout off-policy metrics 记录，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html
d. tree-rollout 支持，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/treepo.html （感谢招商银行团队 @li2zhi 的贡献）
e. gkd 训练支持使用 liger_kernel loss（--use_liger_kernel true）。
f. 新增 GRPO loss_type，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/loss_types.html
训练
a. cached dataset 重构，更好支持大型数据集离线 tokenize 场景，脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
b. 预训练场景 --truncation_strategy split 策略支持，将长文本切成多条数据样本避免 tokens 浪费。
c. packing_num_proc 参数支持。
d. Qwen2.5-VL系列模型兼容使用 "qwen_vl_utils>=0.14"。
e. MFU 日志插件支持。(感谢 @y2logic 的贡献)
国产化硬件（感谢昇腾和招商银行技术团队的贡献）
a. Megatron-SWIFT 支持昇腾 NPU，文档参考：https://swift.readthedocs.io/zh-cn/latest/BestPractices/NPU-support.html
b. 昇腾NPU混合算子支持 Qwen2、Qwen3、Qwen3-MoE 系列模型，加速训练过程。

新模型

纯文本模型：
a. moonshotai/Kimi-K2-Thinking
多模态模型：
a. SenseNova/SenseNova-SI-InternVL3-2B系列
b. mistralai/Ministral-3-3B-Instruct-2512系列
c. mistralai/Mistral-Small-3.2-24B-Instruct-2506

English Version

New Features

Megatron-SWIFT
a. GRPO training support on Megatron, documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/GRPO.html
b. FP8 blockwise training support, including FP8 weight loading and exporting. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/fp8
c. MTP training support, training script: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora/mtp.sh
d. New model support: GPT-OSS, Llama4, InternVL3.5-GPT-OSS, etc.
e. Support for saving strategy --save_strategy epoch.
f. Compatible with megaron-core versions 0.12–0.15.
RL
a. New algorithm SAPO supported, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/SAPO.html
b. New algorithm CISPO supported, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CISPO.html
c. Algorithms for mitigating training–inference mismatch, including TIS/MIS and rollout off-policy metrics. Docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html
d. Tree-rollout support, docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/treepo.html (Thanks to CMB team @li2zhi for the contribution)
e. GKD training supports liger_kernel loss (--use_liger_kernel true).
f. New GRPO loss types added, docs: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/loss_types.html
Training
a. Cached dataset refactoring for better offline tokenization of large datasets. Scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
b. Pretraining --truncation_strategy split support, splitting long text into multiple samples to avoid token waste.
c. Added packing_num_proc parameter support.
d. Qwen2.5-VL series models compatible with "qwen_vl_utils>=0.14".
e. MFU logging plugin support (Thanks to @y2logic).
Domestic Hardware Support (Thanks to Ascend and CMB technical teams)
a. Megatron-SWIFT supports Ascend NPU, documentation: https://swift.readthedocs.io/en/latest/BestPractices/NPU-support.html
b. Ascend NPU mixed operators support Qwen2, Qwen3, Qwen3-MoE series models, accelerating training.

New Models

Text-only models:
a. moonshotai/Kimi-K2-Thinking
Multimodal models:
a. SenseNova/SenseNova-SI-InternVL3-2B series
b. mistralai/Ministral-3-3B-Instruct-2512 series
c. mistralai/Mistral-Small-3.2-24B-Instruct-2506

What's Changed

bump version 3.11.0.dev by @Jintao-Huang in #6560
[model] support Kimi-K2 by @Jintao-Huang in #6562
[bugfix] fix pp vit_lr by @Jintao-Huang in #6565
[bugfix] fix tools parse in gkd/grpo server mode by @hjh0119 in #6568
[bugfix] fix grpo with reward model by @hjh0119 in #6567
[bugfix] fix mcore-bridge vpp by @Jintao-Huang in #6581
qwen2.5-vl compat qwen_vl_utils version by @Jintao-Huang in #6584
[bugfix] fix packing_length by @Jintao-Huang in #6594
[dataset] support packing_num_proc by @Jintao-Huang in #6592
Fix emb loss scale by @tastelikefeet in #6597
[megatron] compat megatron-core 0.12-0.14 by @Jintao-Huang in #6599
[kto] fix kto loss_type=apo_zero_unpaired by @Jintao-Huang in #6601
Fix command line display for UI by @slin000111 in #6603
Support Megatron GRPO by @hjh0119 in #6025
[megatron] fix train_iters by @Jintao-Huang in #6611
[bugfix] fix modelscope patch_hub by @Jintao-Huang in #6612
[template] support add_eos by @Jintao-Huang in #6613
[dataset] refactor cached_dataset by @Jintao-Huang in #6561
[bugfix]fix add_eos in gkd/grpo for truncated sample encode by @hjh0119 in #6618
Support GKD Liger Kernel Loss by @hjh0119 in #6619
Support generative reranker right pad by @0russwest0 in #6573
update swift image 3.10.1 by @Jintao-Huang in #6622
[model] support mistral 2506 by @Jintao-Huang in #6624
update peft version by @Jintao-Huang in #6621
[bugfix] Fix multinode write conflict mcore-bridge (deepseek-v3) by @Jintao-Huang in #6626
Initialize chord dataset after accelerator setup in GRPOTrainer by @tongchen126 in #6638
[bugfix] fix megatron grpo max_epochs by @hjh0119 in #6646
[bugfix] fix megatron grpo server mode sync weight by @hjh0119 in #6648
[megatron] fix save barrier by @Jintao-Huang in #6653
[bugfix] fix megatron grpo rollout_group by @hjh0119 in #6655
[bugfix] fix chatml chat template by @Jintao-Huang in #6656
[bugfix] fix train_type full freeze_llm by @Jintao-Huang in #6651
[mcore-bridge] optimize gpt_bridge comm by @Jintao-Huang in #6659
[algo] support cispo algorithm by @hjh0119 in #6572
[model] Support SenseNova-SI by @hjh0119 in #6657
[megatron] fix swift export merge_lora by @Jintao-Huang in #6664
[bugfix] memory log is missing on Ascend NPU by @baymax591 in #6647
update doc by @tastelikefeet in #6665
[bugfix] Fix GKD with TRL >= 0.24 & GKD Liger by @hjh0119 in #6663
[template] support truncation_strategy spllit (swift pt) by @Jintao-Huang in #6672
[bugfix] fix qwen3_omni seq_cls by @Jintao-Huang in #6673
[bugfix] getattr error for activation_offloading in RM training by @hjh0119 in #6677
[bugfix] fix liger-kernel version check by @hjh0119 in #6679
[bugfix] fix qwen3_vl image_list fps by @Jintao-Huang in #6696
[bugfix] fix logprobs in vllm sampling params by @hjh0119 in #6698
[megatron] support global_aux_loss by @Jintao-Huang in #6699
[bugfix] fix megatron grpo local jsonl writer by @hjh0119 in #6700
fix type_type=rm eval trl>=0.25 by @Jintao-Huang in #6701
add npu fsdp example by @addsubmuldiv in #6697
add npu deepspeed example by @addsubmuldiv in https://github.com/model...

Contributors

addsubmuldiv, ji-huazhong, and 18 other contributors

Assets 2

30 Nov 06:35

Jintao-Huang

v3.10.3

479fb10

Patch release v3.10.3

Full Changelog: v3.10.2...v3.10.3

Assets 2

23 Nov 09:58

Jintao-Huang

v3.10.2

650e407

Patch release v3.10.2

Full Changelog: v3.10.1...v3.10.2

Assets 2

16 Nov 16:50

Jintao-Huang

v3.10.1

f7450e2

Patch release v3.10.1

Full Changelog: v3.10.0...v3.10.1

Assets 2

11 Nov 12:14

Jintao-Huang

v3.10.0

35c0542

v3.10.0

中文版

新特性

Megatron-SWIFT
a. Mcore-Bridge发布。支持直接加载和存储 safetensors 格式的模型权重；支持LoRA增量权重双向转换；支持多机转换。文档参考：https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Mcore-Bridge.html 。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/mcore_bridge
b. megatron-core 版本升级至0.14.0。
c. 多模态模型训练新增 vit_lr 和 aligner_lr 参数支持。
d. 新增存储优化参数：async_save, save_retain_interval等。
e. 支持batched mrope，加速Qwen3-VL、Qwen2.5-VL等模型的训练速度。
RL
a. GRPO LoRA 训练权重同步速度优化，具体参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/GetStarted/GRPO.html#id3
b. GRPO 训练显存优化以降低峰值显存占用。
c. RLVR 新算法支持：RLOO，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/RLOO.html 。REINFORCE++ Baseline，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/REINFORCEPP.html
d. GKD 支持使用 vLLM 加速策略模型rollout，并新增参数teacher_deepspeed额外控制教师模型分片策略。文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html
e. GSPO 支持使用liger_kernel减少显存使用。
训练
a. PT/SFT/采样/数据蒸馏中支持了RAY，具体参考文档：https://swift.readthedocs.io/zh-cn/latest/Instruction/Ray.html
b. Qwen3-VL、Qwen3-Omni支持混合模态数据训练；Qwen3-VL支持ulysses序列并行。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
c. 支持 yaml 方式配置训练参数，脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/yaml
d. 新增 FSDP2 训练启动案例，脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/fsdp2_lora
e. 新增自定义多模态模型注册最佳实践：https://swift.readthedocs.io/zh-cn/latest/BestPractices/MLLM-Registration.html
f. embedding 训练中的 InfoNCE 损失与 Qwen3-Embedding 论文描述对齐。具体参考文档：https://swift.readthedocs.io/zh-cn/latest/BestPractices/Embedding.html
g. 新增多标签分类训练案例，脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/seq_cls/multi_label
h. agent_template 支持 seed-oss。感谢@hpsun1109的贡献。
全链路
a. swift export支持 GPTQ-v2 量化，脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq_v2.sh 。感谢@zzc0430的贡献。
b. swift deploy vllm推理后端支持 DP 部署，使用--vllm_data_parallel_size参数。感谢@YushunXiang 的贡献。
c. swift deploy 新增 health/ping endpoints。
d. vLLM 部署新增参数 vllm_mm_processor_cache_gb/vllm_engine_kwargs。

新模型

纯文本模型：
a. Qwen/Qwen3Guard-Gen-0.6B系列
b. MiniMax/MiniMax-M2
多模态模型：
a. Qwen/Qwen3-VL-2B-Instruct系列
b. deepseek-ai/DeepSeek-OCR，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/deepseek_ocr
c. PaddlePaddle/PaddleOCR-VL
d. ZhipuAI/Glyph
e. PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking系列
f. lmms-lab/LLaVA-OneVision-1.5-4B-Instruct系列

English Version

New Features

Megatron-SWIFT
a. Mcore-Bridge Release. Supports direct loading and saving of model weights in safetensors format; supports bidirectional conversion of LoRA incremental weights; supports multi-node conversion. Documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Mcore-Bridge.html. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/mcore_bridge
b. Upgraded megatron-core version to 0.14.0.
c. Added vit_lr and aligner_lr parameter support for multimodal model training.
d. Added storage optimization parameters: async_save, save_retain_interval, etc.
e. Support for batched mrope to accelerate training speed of Qwen3-VL, Qwen2.5-VL, and other models.
RL
a. GRPO LoRA training weight synchronization speed optimization. Details: https://swift.readthedocs.io/en/latest/Instruction/GRPO/GetStarted/GRPO.html#memory-optimization-solutions-in-colocate-mode
b. GRPO training memory optimization to reduce peak memory consumption.
c. New RLVR algorithm support: RLOO, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/RLOO.html. REINFORCE++ Baseline, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/REINFORCEPP.html
d. GKD supports using vLLM to accelerate policy model rollout, with new parameter teacher_deepspeed for additional control of teacher model sharding strategy. Documentation: https://swift.readthedocs.io/en/latest/Instruction/GKD.html
e. GSPO supports using liger_kernel to reduce memory usage.
Training
a. RAY support added for PT/SFT/Sampling/Data Distillation, documentation: https://swift.readthedocs.io/en/latest/Instruction/Ray.html
b. Qwen3-VL and Qwen3-Omni support mixed modality data training; Qwen3-VL supports Ulysses sequence parallelism. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
c. Support for YAML-based training parameter configuration, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/yaml
d. Added FSDP2 training launch example, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/fsdp2_lora
e. Added best practice for custom multimodal model registration: https://swift.readthedocs.io/en/latest/BestPractices/MLLM-Registration.html
f. InfoNCE loss in embedding training aligned with Qwen3-Embedding paper description. Documentation: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
g. Added multi-label classification training example, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/seq_cls/multi_label
h. agent_template supports seed-oss. Thanks to @hpsun1109 for the contribution.
Full Pipeline
a. swift export supports GPTQ-v2 quantization, scripts: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq_v2.sh. Thanks to @zzc0430 for the contribution.
b. swift deploy vLLM inference backend supports DP deployment, using --vllm_data_parallel_size parameter. Thanks to @YushunXiang for the contribution.
c. swift deploy added health/ping endpoints.
d. vLLM deployment added parameters vllm_mm_processor_cache_gb/vllm_engine_kwargs.

New Models

Text-only models:
a. Qwen/Qwen3Guard-Gen-0.6B series
b. MiniMax/MiniMax-M2
Multimodal models:
a. Qwen/Qwen3-VL-2B-Instruct series
b. deepseek-ai/DeepSeek-OCR, training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/deepseek_ocr
c. PaddlePaddle/PaddleOCR-VL
d. ZhipuAI/Glyph
e. PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking series
f. lmms-lab/LLaVA-OneVision-1.5-4B-Instruct series

What's Changed

[bugfix] fix image_list qwen2.5/3-omni by @Jintao-Huang in #6122
[model] Support Qwen3-VL dense by @Jintao-Huang in #6120
feat: support gptq_v2 quantization method by @zzc0430 in #6102
[bugfix] fix gptq_v2 by @Jintao-Huang in #6126
[bugfix] patch timeout & fix print_rich_table by @Jintao-Huang in #6137
Add the support for vLLM data parallel configuration in SwiftDeploy by @YushunXiang in #6114
[docs] update vllm deploy DP docs by @Jintao-Huang in #6139
[model] Support Qwen/Qwen3-VL-4B-Instruct series by @Jintao-Huang in #6143
Update loss_scale method call to pass through inputs.extra_kwargs by @CJack812 in #6160
[bugfix] fix qwen3_vl videos by @Jintao-Huang in #6162
Fix bug of sp/cp by @tastelikefeet in #6163
[deploy] update vllm_enable_prefix_caching by @Jintao-Huang in #6165
[bugfix] qwen3-vl support mixed data by @Jintao-Huang in #6161
[template] add_retry by @Jintao-Huang in #6138
[bugfix] Fix multimodal lazy_tokenize false by @Jintao-Huang in #6172
[template] update qwen3_vl grounding dataset format by @Jintao-Huang in #6178
[docs] update docs by @Jintao-Huang in #6180
[bugfix] add tools fileds in inputs2reqeusts by @hjh0119 in #6054
[grpo] Optimize vLLM weight synchronization & update buitin accuracy reward by @hjh0119 in #5773
[model] support Qwen/Qwen3Guard-Gen-0.6B series by @Jintao-Huang in #6189
[template] Support qwen3 omni mixed data by @Jintao-Huang in #6196
[docs] update qwen3_vl best practice by @Jintao-Huang in #6206
[vllm] support vllm_mm_processor_cache_gb by @hjh0119 in #6210
[megatron] fix qwen3_vl new_special_tokens by @Jintao-Huang in #6213
[megatron] add mcore save_args by @Jintao-Huang in #6216
[bugfix] fix dtype warning by @Jintao-Huang in #6219
[bugfix] fix infer pt dp by @Jintao-Huang in #6222
support training for multimodal reranker by @0russwest0 in #6192
[bugfix] fix reward_trainer logger by @Jintao-Huang in #6240
[model] Support deepseek-ocr by @Jintao-Huang in #6238
[docs] update deepseek_ocr docs by @Jintao-Huang in #6242
[bugfix] fi...