Releases: modelscope/ms-swift
Releases · modelscope/ms-swift
Patch release v4.0.2
Full Changelog: v4.0.1...v4.0.2
Patch release v4.0.1
Full Changelog: v4.0.0...v4.0.1
v4.0.0
中文版
新特性
- 架构优化
a. 目录结构重构与依赖关系优化,使用模块化设计,提升架构的可扩展性和可定制性。
b.model_type与template解耦,简化同一 model_type 含多个 template 的模型支持流程。
c. Megatron-SWIFT 训练循环重写,使用 megatron-core 替代 megatron-lm 依赖。(兼容Ascend NPU) - Megatron-SWIFT
a. 新模型支持:Qwen3.5系列、GLM4.7-Flash、MiniMax-M2.1、OLMoE。
b. Embedding 任务支持,训练示例:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker 任务支持,训练示例:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. 新增save_total_limit参数,自动清理过期 checkpoint,并保留指标最优和最新的权重。
e. Qwen3-Next/Qwen3.5 新增apply_wd_to_qk_layernorm参数,支持对 qk layernorm 应用权重衰减。
f. 多模态MoE模型lora支持--target_modules all-router配置。 - RL
a. 支持GDPO算法计算优势,使用参数--scale_rewards gdpo。(感谢 @Auraithm 的贡献)
b. GKD 支持使用 top-k logits 计算KL以节约显存,使用参数--gkd_topk_logits。
c. GKD 支持使用 teacher server,避免显式加载教师模型。 - 训练
a. 新增 muon clip 优化器支持,训练示例:https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (感谢 @vx120 的贡献)
b. 依赖更新:兼容最新依赖 python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0等。
c. generative reranker lm_head 部分计算优化,降低显存占用。
d. fsdp2支持激活 cpu offload;deepspeed elastic支持。(感谢招商 @meichangsu1 的贡献)
新模型
- 纯文本模型
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct系列(感谢 @qianhao0713 的贡献) - 多模态模型
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B 系列。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it 系列
English Version
New Features
- Architecture Optimization
a. Directory structure refactoring and dependency optimization with modular design to enhance architecture scalability and customizability.
b. Decoupling ofmodel_typeandtemplateto simplify support for models with multiple templates under the same model_type.
c. Rewritten Megatron-SWIFT training loop using megatron-core instead of megatron-lm dependency. (Compatible with Ascend NPU) - Megatron-SWIFT
a. New model support: Qwen3.5 series, GLM4.7-Flash, MiniMax-M2.1, OLMoE.
b. Embedding task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
c. Reranker task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
d. Addedsave_total_limitparameter to automatically clean up expired checkpoints while retaining the best-performing and latest weights.
e. Addedapply_wd_to_qk_layernormparameter for Qwen3-Next/Qwen3.5 to support weight decay on qk layernorm.
f. Multi-modal MoE model LoRA supports--target_modules all-routerconfiguration. - RL
a. Support for GDPO algorithm to compute advantages using parameter--scale_rewards gdpo. (Thanks to @Auraithm)
b. GKD supports using top-k logits to compute KL for memory savings with parameter--gkd_topk_logits.
c. GKD supports using teacher server to avoid explicitly loading the teacher model. - Training
a. Added Muon-CLIP optimizer support. Training example: https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (Thanks to @vx120)
b. Dependency updates: Compatible with latest dependencies including python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0, etc.
c. Optimized generative reranker lm_head computation to reduce memory usage.
d. FSDP2 supports CPU offload activation; DeepSpeed elastic support. (Thanks to @meichangsu1)
New Models
- Text-only Models
a. Qwen/Qwen3-Coder-Next
b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
c. MiniMaxAI/MiniMax-M2.1
d. Tencent-YouTu-Research/Youtu-LLM-2B
e. IQuestLab/IQuest-Coder-V1-40B-Instruct
f. allenai/OLMoE-1B-7B-0924-Instruct series (Thanks to @qianhao0713) - Multi-modal Models
a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
b. Qwen3-VL-Embedding, Qwen3-VL-Reranker. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
c. deepseek-ai/DeepSeek-OCR-2
d. ZhipuAI/GLM-OCR
e. PaddlePaddle/PaddleOCR-VL-1.5
f. OpenBMB/MiniCPM-o-4_5
g. stepfun-ai/Step3-VL-10B
h. google/medgemma-4b-it series
What's Changed
- [misc] update swift patch_conv3d by @Jintao-Huang in #7320
- add npu megatron multi-node example by @addsubmuldiv in #7321
- [bugfix] fix megatron convert by @Jintao-Huang in #7323
- [model] Support Qwen3-VL-Embedding/Qwen3-VL-Reranker by @Jintao-Huang in #7329
- [reranker] refactor reranker by @Jintao-Huang in #7334
- [bugfix] fix video base64 torchcodec by @Jintao-Huang in #7338
- [bugfix] fix modelopt by @Jintao-Huang in #7339
- [docs] Update swift image 3.12 by @Jintao-Huang in #7332
- [bugfix] fix get_chunked_inputs slice by @hjh0119 in #7346
- fix find node ip by @tastelikefeet in #7350
- Fix multi-modal reranker doc by @tastelikefeet in #7354
- [bugfix] fix app_args by @Jintao-Huang in #7367
- [bugfix] fix qwen2_vl video by @Jintao-Huang in #7376
- [bugfix] fix vllm moe model load_weights by @hjh0119 in #7362
- [v4] refactor ms-swift v4 by @Jintao-Huang in #7238
- feat: support scale rewards "gdpo" by @Auraithm in #7348
- [infer] infer backend pt -> transformers by @Jintao-Huang in #7379
- [docs] update docs & update Copyright by @Jintao-Huang in #7384
- Fix device mismatch in _forward_qwen3_vl_or_qwen3_omni when computing visual_pos_masks by @yaqiangsun in #7372
- add npu qwen3-next example and warning of ep size by @addsubmuldiv in #7390
- [bugfix] fix deepseek_v3_1 thinking template by @Jintao-Huang in #7388
- [docs] update docs & update dataset 'loss' by @Jintao-Huang in #7402
- [bugfix] Fix ref adapters trainable params 0 by @Jintao-Huang in #7403
- [readme] update error timeline of news by @shizhengLi in #7404
- [bugfix] fix sp reranker by @Jintao-Huang in #7405
- [v4] fix ci by @Jintao-Huang in #7559
- [refactor] reorganize reward and rollout modules into dedicated direct… by @hjh0119 in #7397
- [grpo] speedup grpo train stage encode with concurrent by @Cccei000 in #7391
- Update the NPU-supported features table by @addsubmuldiv in #7562
- [bugfix] fix attn_impl by @Jintao-Huang in #7564
- [v4] refactor ms-swift v4 (pipelines/arguments/swiftmixin/callback/tuner_plugin) by @Jintao-Huang in #7385
- [bugfix] fix minimax tp by @Jintao-Huang in #7788
- fix inputs_embeds for hunyuanOCR by @slin000111 in #7803
- [bugfix] fix deepspeed distributed weight offload code by @Silas-11 in #7802
- [generative_reranker] generative reranker logits memory optimization by @Jintao-Huang in #7816
- update requirements by @Jintao-Huang in #7819
- [misc] update issue template by @Jintao-Huang in #7818
- [bugfix] fix dpo by @Jintao-Huang in #7824
- update wechat by @tastelikefeet in #7827
- [bugfix] fix deepspeed optimizer offload code by @Silas-11 in #7821
- [model] support glm4_moe_lite by @Jintao-Huang in #7829
- [bugfix] fix hunyuan ocr by @Jintao-Huang in #7831
- [megatron] support glm_moe_lite by @Jintao-Huang in #7833
- chore: epochs -> epoch by @zzc0430 in #7825
- [optimizer] Set loss mask to compute the loss for multi-turn reasoning by @Simon-ss7 in #7838
- [bugfix] fix recompute_granularity none by @Jintao-Huang in https://gi...
Patch release v3.12.6
What's Changed
Full Changelog: v3.12.5...v3.12.6
Patch release v3.12.5
Full Changelog: v3.12.4...v3.12.5
Patch release v3.12.4
Full Changelog: v3.12.3...v3.12.4
Patch release v3.12.3
Full Changelog: v3.12.2...v3.12.3
Patch release v3.12.2
Full Changelog: v3.12.1...v3.12.2
v3.12.1
What's Changed
- [bugfix] fix glm4_7 agent_template by @Jintao-Huang in #7256
- [bugfix] fix DeepSeek-OCR vllm deploy by @hjh0119 in #7258
- [feat] add async reward function support for GRPO training by @hjh0119 in #7252
- [model] support medgemma by @slin000111 in #7261
- [megatron] Support MiniMaxAI/MiniMax-M2.1 by @Jintao-Huang in #7262
- Support muonclip optimizer by @vx120 in #7191
- add task_type by @slin000111 in #7265
- [bugfix] fix mtp save by @Jintao-Huang in #7267
- [feat] support megatron grpo entropy mask & log by @hjh0119 in #7263
- [model] support iquestcoder by @Jintao-Huang in #7271
- [bugfix] fix reward model adapters by @hjh0119 in #7293
- Fix the issue of repeated inference in multi-turn scheduler. by @Simon-ss7 in #7279
- [bugfix] auto-enable async engine for vLLM encode tasks by @hjh0119 in #7301
- [bugfix] fix vllm_engine load_format by @Jintao-Huang in #7302
- fix npu megatron cp by @addsubmuldiv in #7299
- [misc] Remove unnecessary clone operations during weight synchronization by @hjh0119 in #7308
- [model] support youtu-llm by @hjh0119 in #7306
- [megatron] fix gpt_bridge oom by @Jintao-Huang in #7310
- [misc] fix youtu agent template type-checking by @hjh0119 in #7311
- [bugfix] Fix duplicate 'load_format' argument being passed in rollout by @hjh0119 in #7312
New Contributors
- @Simon-ss7 made their first contribution in #7279
Full Changelog: v3.12.0...v3.12.1
v3.12.0
中文版
新特性
- Megatron-SWIFT
a. GKD算法支持Megatron训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/GKD.html
b. 新模型支持:GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V。
c.save_safetensors支持断点续训,将 Mcore-Bridge 加载和存储方式作为推荐方式。
d. 非 padding-free 训练模式支持更多训练阶段:GRPO/DPO/KTO/RM/序列分类。
e. group_by_length 参数支持,将数据集长度大致相同的样本分组在一起(含随机因素),加速非packing模式下训练速度。
f. 支持--report_to参数,将训练日志在 wandb/swanlab 中记录并可视化。
g. Qwen3-Next 使用 Zero-Centered RMSNorm,与 transformers 对齐。
h.train_dataloader_shuffle参数支持,控制训练数据集是否随机。
i. template.encode 新增重试机制,避免 megatron 训练因网络问题获取图片/视频报错而卡住。 - RL
a. 增加 Off-Policy Sequence Masking (from DeepSeek-V3.2),文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
b. GRPO 增加参数 num_generations_eval 设置 eval 阶段的生成数量。
c. 优化 GKD loss 计算的显存峰值。
d. GRPO/GKD server mode 支持使用 ipv6 地址。
e. 支持使用 structured_outputs_regex 进行结构化输出采样。 - 训练
a. embedding/reranker/序列分类任务支持序列 packing 和序列并行。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
b. 支持--fsdp fsdp2使用 ms-swift 内置的 FSDP2 配置文件。
c. loss_scale 支持3种基本策略:'default'、'last_round'、'all'与其他策略的混合使用,例如:'last_round+ignore_empty_think'。
d. cached_dataset 支持 embedding/reranker/序列分类训练任务,训练脚本参考https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
e. thinking template 重构,ThinkingTemplate 功能合入 Template,新增enable_thinking,add_non_thinking_prefix参数。
f. 新增SWIFT_PATCH_CONV3D环境变量,避免 torch2.9 环境跑 conv3d 缓慢的问题。
g. 支持swanlab_notification_method参数,在训练完成/发生错误时,指定 swanlab 通知方式。
h.dataloader_prefetch_factor参数默认值从10修改为2。 - 国产化硬件(感谢昇腾和招商银行技术团队的贡献)
a. 新增更多训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/ascend
b. Qwen3-VL 混合算子支持,具体查看这个PR:#7079
c. 更新 Megatron-SWIFT NPU 性能采集/精度采集相关文档,参考这里:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Ascend.html
新模型
- 纯文本模型:
a. ZhipuAI/GLM-4.7系列
b. iic/QwenLong-L1.5-30B-A3B
c. gongjy/MiniMind2 (感谢 @PiggerZZM 的贡献) - 多模态模型:
a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash系列
b. Tencent-Hunyuan/HunyuanOCR
English Version
New Features
- Megatron-SWIFT
a. GKD algorithm supports Megatron training. Documentation reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/GKD.html
b. New model support: GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V.
c.save_safetensorssupports checkpoint resumption, with Mcore-Bridge loading and storage method as the recommended approach.
d. Non-padding-free training mode supports more training stages: GRPO/DPO/KTO/RM/sequence classification.
e.group_by_lengthparameter support, grouping samples with similar lengths in the dataset together (with random factors) to accelerate training speed in non-packing mode.
f. Support for--report_toparameter to log and visualize training logs in wandb/swanlab.
g. Qwen3-Next uses Zero-Centered RMSNorm, aligned with transformers.
h.train_dataloader_shuffleparameter support to control whether training dataset is shuffled.
i. Added retry mechanism to template.encode to prevent megatron training from freezing due to network issues when fetching images/videos. - RL
a. Added Off-Policy Sequence Masking (from DeepSeek-V3.2). Documentation reference: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
b. GRPO addsnum_generations_evalparameter to set the number of generations during eval stage.
c. Optimized memory peak for GKD loss calculation.
d. GRPO/GKD server mode supports using ipv6 addresses.
e. Support for structured output sampling usingstructured_outputs_regex. - Training
a. Embedding/reranker/sequence classification tasks support sequence packing and sequence parallelism. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
b. Support for--fsdp fsdp2to use ms-swift built-in FSDP2 configuration file.
c.loss_scalesupports 3 basic strategies: 'default', 'last_round', 'all' and their hybrid use with other strategies, e.g., 'last_round+ignore_empty_think'.
d.cached_datasetsupports embedding/reranker/sequence classification training tasks. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
e. Thinking template refactored, ThinkingTemplate functionality merged into Template, addedenable_thinkingandadd_non_thinking_prefixparameters.
f. AddedSWIFT_PATCH_CONV3Denvironment variable to avoid slow conv3d execution in torch2.9 environment.
g. Support forswanlab_notification_methodparameter to specify swanlab notification method when training completes/errors occur.
h.dataloader_prefetch_factorparameter default value changed from 10 to 2. - Domestic Hardware (Thanks to Ascend and CMB technical teams)
a. Added more training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/ascend
b. Qwen3-VL hybrid operator support, see this PR: #7079
c. Updated Megatron-SWIFT NPU performance collection/accuracy collection documentation, reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Ascend.html
New Models
- Text-only models:
a. ZhipuAI/GLM-4.7 series
b. iic/QwenLong-L1.5-30B-A3B
c. gongjy/MiniMind2 (Thanks to @PiggerZZM's contribution) - Multimodal models:
a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash series
b. Tencent-Hunyuan/HunyuanOCR
What's Changed
- [model] Support GLM4.6-V by @Jintao-Huang in #6948
- [model] support glm4_6v flash by @Jintao-Huang in #6959
- [bugfix] fix truncation_strategy left by @Jintao-Huang in #6961
- [bugfix] fix megatron save_checkpoint by @Jintao-Huang in #6963
- [feat] GKD support truncation strategy delete to resample by @hjh0119 in #6964
- [misc] megatron grpo check rollout_logps by @hjh0119 in #6970
- [misc] set default group_port for vllm client by @hjh0119 in #6972
- [grpo] support Off-Policy Sequence Masking by @hjh0119 in #6978
- [megatron, misc] support check_latest_model by @hjh0119 in #6988
- [bugfix] fix reranker_padding_free by @Jintao-Huang in #6989
- [megatron] fix eval_iters 1 by @Jintao-Huang in #6990
- Add dense_npu.sh for megatron lora training in huawei npu by @vx120 in #6976
- fix system
swift ptby @Jintao-Huang in #7003 - [bugfix] fix qwen_vl_utils torchvision base64 by @Jintao-Huang in #7004
- [bugfix] fix liger_kernel flash_attn by @Jintao-Huang in #7005
- [bugfix] fix qwen3_vl bridge by @Jintao-Huang in #7006
- [bugfix] fix reranker padding_free & fix seq_cls omni padding_free by @Jintao-Huang in #7007
- [npu] add npu qwen3_omni sft example for mindspeed backend by @tongtong0613 in #7008
- [bugfix] qwen-omni3 vllm infer with USE_AUDIO_IN_VIDEO by @hjh0119 in #7009
- [bugfix] fix grpo sleep_level 2 causes gibberish outputs by @hjh0119 in #7017
- add npu vllm-ascend docs and examples by @addsubmuldiv in #7013
- [compat] fix mcore012 compat torch new by @Jintao-Huang in #7021
- [megatron] Megatron support random/non-random dataloader by @Jintao-Huang in #7016
- [bugfix] megatron add retry to avoid hang by @Jintao-Huang in #7023
- [trainer] refactor acc metrics by @Jintao-Huang in #7026
- [infer] update embddding/reranker demo by @Jintao-Huang in #7029
- [train] support embeding/reranker packing & support reranker/embedding cache_dataset by @Jintao-Huang in #6987
- update readme by @Jintao-Huang in #7033
- [misc] update swift image by @Jintao-Huang in #7039
- [bugfix] remove add_eos for rm in grpo by @hjh0119 in #7040
- [npu] Fix device mismatch in weight sync for HCCL communicator by @singing4you in #7036
- collect npu profiling data by @OneMondy in #6977
- [bugfix] fix null_ref_context by @Jintao-Huang in #7042
- [model] support hunyuan_ocr by @slin000111 in #7038
- update flash_attn version; fix mcore 0.15 hang by @Jintao-Huang in #7043
- [bugfix] fix grpo multi turn log_entropy by @hjh0119 in #7044
- [bugfix] fix dataloader megatron by @Jintao-Huang in #7050
- [grpo] support num_generations_eva...