Skip to content

Releases: modelscope/ms-swift

Patch release v4.0.2

14 Mar 14:20

Choose a tag to compare

Patch release v4.0.1

08 Mar 04:33

Choose a tag to compare

v4.0.0

03 Mar 08:25

Choose a tag to compare

中文版

新特性

  1. 架构优化
    a. 目录结构重构与依赖关系优化,使用模块化设计,提升架构的可扩展性和可定制性。
    b. model_typetemplate解耦,简化同一 model_type 含多个 template 的模型支持流程。
    c. Megatron-SWIFT 训练循环重写,使用 megatron-core 替代 megatron-lm 依赖。(兼容Ascend NPU)
  2. Megatron-SWIFT
    a. 新模型支持:Qwen3.5系列、GLM4.7-Flash、MiniMax-M2.1、OLMoE。
    b. Embedding 任务支持,训练示例:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
    c. Reranker 任务支持,训练示例:https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
    d. 新增save_total_limit参数,自动清理过期 checkpoint,并保留指标最优和最新的权重。
    e. Qwen3-Next/Qwen3.5 新增apply_wd_to_qk_layernorm参数,支持对 qk layernorm 应用权重衰减。
    f. 多模态MoE模型lora支持 --target_modules all-router 配置。
  3. RL
    a. 支持GDPO算法计算优势,使用参数--scale_rewards gdpo。(感谢 @Auraithm 的贡献)
    b. GKD 支持使用 top-k logits 计算KL以节约显存,使用参数 --gkd_topk_logits
    c. GKD 支持使用 teacher server,避免显式加载教师模型。
  4. 训练
    a. 新增 muon clip 优化器支持,训练示例:https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (感谢 @vx120 的贡献)
    b. 依赖更新:兼容最新依赖 python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0等。
    c. generative reranker lm_head 部分计算优化,降低显存占用。
    d. fsdp2支持激活 cpu offload;deepspeed elastic支持。(感谢招商 @meichangsu1 的贡献)

新模型

  1. 纯文本模型
    a. Qwen/Qwen3-Coder-Next
    b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
    c. MiniMaxAI/MiniMax-M2.1
    d. Tencent-YouTu-Research/Youtu-LLM-2B
    e. IQuestLab/IQuest-Coder-V1-40B-Instruct
    f. allenai/OLMoE-1B-7B-0924-Instruct系列(感谢 @qianhao0713 的贡献)
  2. 多模态模型
    a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B 系列。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
    b. Qwen3-VL-Embedding, Qwen3-VL-Reranker。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
    c. deepseek-ai/DeepSeek-OCR-2
    d. ZhipuAI/GLM-OCR
    e. PaddlePaddle/PaddleOCR-VL-1.5
    f. OpenBMB/MiniCPM-o-4_5
    g. stepfun-ai/Step3-VL-10B
    h. google/medgemma-4b-it 系列

English Version

New Features

  1. Architecture Optimization
    a. Directory structure refactoring and dependency optimization with modular design to enhance architecture scalability and customizability.
    b. Decoupling of model_type and template to simplify support for models with multiple templates under the same model_type.
    c. Rewritten Megatron-SWIFT training loop using megatron-core instead of megatron-lm dependency. (Compatible with Ascend NPU)
  2. Megatron-SWIFT
    a. New model support: Qwen3.5 series, GLM4.7-Flash, MiniMax-M2.1, OLMoE.
    b. Embedding task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding
    c. Reranker task support. Training example: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker
    d. Added save_total_limit parameter to automatically clean up expired checkpoints while retaining the best-performing and latest weights.
    e. Added apply_wd_to_qk_layernorm parameter for Qwen3-Next/Qwen3.5 to support weight decay on qk layernorm.
    f. Multi-modal MoE model LoRA supports --target_modules all-router configuration.
  3. RL
    a. Support for GDPO algorithm to compute advantages using parameter --scale_rewards gdpo. (Thanks to @Auraithm)
    b. GKD supports using top-k logits to compute KL for memory savings with parameter --gkd_topk_logits.
    c. GKD supports using teacher server to avoid explicitly loading the teacher model.
  4. Training
    a. Added Muon-CLIP optimizer support. Training example: https://github.com/modelscope/ms-swift/blob/main/examples/train/optimizer/muonclip.sh (Thanks to @vx120)
    b. Dependency updates: Compatible with latest dependencies including python3.12, transformers 5.2.0, vllm 0.15.1, trl 0.28, liger-kernel 0.7.0, etc.
    c. Optimized generative reranker lm_head computation to reduce memory usage.
    d. FSDP2 supports CPU offload activation; DeepSpeed elastic support. (Thanks to @meichangsu1)

New Models

  1. Text-only Models
    a. Qwen/Qwen3-Coder-Next
    b. ZhipuAI/GLM-4.7-Flash, ZhipuAI/GLM-5
    c. MiniMaxAI/MiniMax-M2.1
    d. Tencent-YouTu-Research/Youtu-LLM-2B
    e. IQuestLab/IQuest-Coder-V1-40B-Instruct
    f. allenai/OLMoE-1B-7B-0924-Instruct series (Thanks to @qianhao0713)
  2. Multi-modal Models
    a. Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-9B series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5
    b. Qwen3-VL-Embedding, Qwen3-VL-Reranker. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/embedding/qwen3, https://github.com/modelscope/ms-swift/tree/main/examples/train/reranker/qwen3
    c. deepseek-ai/DeepSeek-OCR-2
    d. ZhipuAI/GLM-OCR
    e. PaddlePaddle/PaddleOCR-VL-1.5
    f. OpenBMB/MiniCPM-o-4_5
    g. stepfun-ai/Step3-VL-10B
    h. google/medgemma-4b-it series

What's Changed

Read more

Patch release v3.12.6

28 Feb 01:46

Choose a tag to compare

What's Changed

Full Changelog: v3.12.5...v3.12.6

Patch release v3.12.5

14 Feb 10:10

Choose a tag to compare

Patch release v3.12.4

03 Feb 16:44

Choose a tag to compare

Patch release v3.12.3

24 Jan 06:19

Choose a tag to compare

Patch release v3.12.2

17 Jan 07:20

Choose a tag to compare

v3.12.1

08 Jan 02:29

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v3.12.0...v3.12.1

v3.12.0

30 Dec 03:24

Choose a tag to compare

中文版

新特性

  1. Megatron-SWIFT
    a. GKD算法支持Megatron训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/GKD.html
    b. 新模型支持:GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V。
    c. save_safetensors 支持断点续训,将 Mcore-Bridge 加载和存储方式作为推荐方式。
    d. 非 padding-free 训练模式支持更多训练阶段:GRPO/DPO/KTO/RM/序列分类。
    e. group_by_length 参数支持,将数据集长度大致相同的样本分组在一起(含随机因素),加速非packing模式下训练速度。
    f. 支持 --report_to 参数,将训练日志在 wandb/swanlab 中记录并可视化。
    g. Qwen3-Next 使用 Zero-Centered RMSNorm,与 transformers 对齐。
    h. train_dataloader_shuffle 参数支持,控制训练数据集是否随机。
    i. template.encode 新增重试机制,避免 megatron 训练因网络问题获取图片/视频报错而卡住。
  2. RL
    a. 增加 Off-Policy Sequence Masking (from DeepSeek-V3.2),文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
    b. GRPO 增加参数 num_generations_eval 设置 eval 阶段的生成数量。
    c. 优化 GKD loss 计算的显存峰值。
    d. GRPO/GKD server mode 支持使用 ipv6 地址。
    e. 支持使用 structured_outputs_regex 进行结构化输出采样。
  3. 训练
    a. embedding/reranker/序列分类任务支持序列 packing 和序列并行。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
    b. 支持 --fsdp fsdp2 使用 ms-swift 内置的 FSDP2 配置文件。
    c. loss_scale 支持3种基本策略:'default'、'last_round'、'all'与其他策略的混合使用,例如:'last_round+ignore_empty_think'。
    d. cached_dataset 支持 embedding/reranker/序列分类训练任务,训练脚本参考https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
    e. thinking template 重构,ThinkingTemplate 功能合入 Template,新增enable_thinking, add_non_thinking_prefix参数。
    f. 新增 SWIFT_PATCH_CONV3D 环境变量,避免 torch2.9 环境跑 conv3d 缓慢的问题。
    g. 支持 swanlab_notification_method 参数,在训练完成/发生错误时,指定 swanlab 通知方式。
    h. dataloader_prefetch_factor 参数默认值从10修改为2。
  4. 国产化硬件(感谢昇腾和招商银行技术团队的贡献)
    a. 新增更多训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/ascend
    b. Qwen3-VL 混合算子支持,具体查看这个PR:#7079
    c. 更新 Megatron-SWIFT NPU 性能采集/精度采集相关文档,参考这里:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Ascend.html

新模型

  1. 纯文本模型:
    a. ZhipuAI/GLM-4.7系列
    b. iic/QwenLong-L1.5-30B-A3B
    c. gongjy/MiniMind2 (感谢 @PiggerZZM 的贡献)
  2. 多模态模型:
    a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash系列
    b. Tencent-Hunyuan/HunyuanOCR

English Version

New Features

  1. Megatron-SWIFT
    a. GKD algorithm supports Megatron training. Documentation reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/GKD.html
    b. New model support: GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V.
    c. save_safetensors supports checkpoint resumption, with Mcore-Bridge loading and storage method as the recommended approach.
    d. Non-padding-free training mode supports more training stages: GRPO/DPO/KTO/RM/sequence classification.
    e. group_by_length parameter support, grouping samples with similar lengths in the dataset together (with random factors) to accelerate training speed in non-packing mode.
    f. Support for --report_to parameter to log and visualize training logs in wandb/swanlab.
    g. Qwen3-Next uses Zero-Centered RMSNorm, aligned with transformers.
    h. train_dataloader_shuffle parameter support to control whether training dataset is shuffled.
    i. Added retry mechanism to template.encode to prevent megatron training from freezing due to network issues when fetching images/videos.
  2. RL
    a. Added Off-Policy Sequence Masking (from DeepSeek-V3.2). Documentation reference: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
    b. GRPO adds num_generations_eval parameter to set the number of generations during eval stage.
    c. Optimized memory peak for GKD loss calculation.
    d. GRPO/GKD server mode supports using ipv6 addresses.
    e. Support for structured output sampling using structured_outputs_regex.
  3. Training
    a. Embedding/reranker/sequence classification tasks support sequence packing and sequence parallelism. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
    b. Support for --fsdp fsdp2 to use ms-swift built-in FSDP2 configuration file.
    c. loss_scale supports 3 basic strategies: 'default', 'last_round', 'all' and their hybrid use with other strategies, e.g., 'last_round+ignore_empty_think'.
    d. cached_dataset supports embedding/reranker/sequence classification training tasks. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
    e. Thinking template refactored, ThinkingTemplate functionality merged into Template, added enable_thinking and add_non_thinking_prefix parameters.
    f. Added SWIFT_PATCH_CONV3D environment variable to avoid slow conv3d execution in torch2.9 environment.
    g. Support for swanlab_notification_method parameter to specify swanlab notification method when training completes/errors occur.
    h. dataloader_prefetch_factor parameter default value changed from 10 to 2.
  4. Domestic Hardware (Thanks to Ascend and CMB technical teams)
    a. Added more training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/ascend
    b. Qwen3-VL hybrid operator support, see this PR: #7079
    c. Updated Megatron-SWIFT NPU performance collection/accuracy collection documentation, reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Ascend.html

New Models

  1. Text-only models:
    a. ZhipuAI/GLM-4.7 series
    b. iic/QwenLong-L1.5-30B-A3B
    c. gongjy/MiniMind2 (Thanks to @PiggerZZM's contribution)
  2. Multimodal models:
    a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash series
    b. Tencent-Hunyuan/HunyuanOCR

What's Changed

Read more