Skip to content

v4.2.0

Choose a tag to compare

@Jintao-Huang Jintao-Huang released this 07 May 09:14
· 161 commits to main since this release

中文版

新特性

  1. Megatron-SWIFT
    a. 新增 model_type 支持:kimi_k25、hy_v3、llava_onevision。(llava_onevision 感谢 @randydl 的贡献)
    b. 支持 GLM-5 共享参数 MTP,可通过 --mtp_shared_weights 参数启用。
    c. 支持 Qwen3.5 FP8 训练,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
    d. 自定义 Megatron 模型文档:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Custom-Model.html
    e. 支持控制 MTP 分支中 decoder_input 是否停止梯度,即 MTP loss 能否直接通过 decoder_input 回传梯度到 Embedding/ViT,可通过 --mtp_decoder_input_detach 参数控制。
    f. mlp_padding_free 参数兼容序列并行
    g. 支持通过 megatron export 命令进行权重 FP8 量化导出,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
    h. 移除对 megatron-core 0.12 - 0.14 版本的依赖兼容支持。
  2. RL
    a. GKD/OPSD 支持设置 generation_batch_size/steps_per_generaiton 参数。
    b. GKD/OPSD teacher_server_api 兼容多模态训练。
    c. GKD/OPSD 兼容 padding_free。
    d. Megatron GRPO/GKD 权重同步支持仅同步 LoRA 权重。
    e. swift rollout 新增异常捕获机制,避免进程静默卡死。
    f. GRPO ref_sync_callback 支持在 ZeRO-3 下进行分层 gather,避免 OOM。
    g. GRPO TRL 依赖版本升级至 >= 0.26。
  3. 训练
    a. 支持 Qwen3.5 序列并行,可通过 --sequence_parallel_size 参数控制。(感谢 @meichangsu1 的贡献)
    b. 支持在数据集中直接指定 loss_scale,提供更灵活的控制方式,参考文档:https://swift.readthedocs.io/zh-cn/latest/Customization/Custom-dataset.html#id4
    c. 数据集 datasets 依赖兼容 4.x 版本。
    d. cached_dataset 与 --truncation_strategy split 策略兼容。
  4. 硬件
    a. NPU 支持基于 transformers/Megatron 后端的 Qwen3.5 训练,使用 Megatron 后端时需开启 USE_MCORE_GDN=0 环境变量。(感谢 @addsubmuldiv@hazelduan 的贡献)
    b. 新增 AMD 支持文档:https://swift.readthedocs.io/zh-cn/latest/BestPractices/AMD-support.html (感谢 @Treemann 的贡献)
    c. 支持 Metax 硬件的 RL 训练。(感谢 @suenphey 的贡献)
    d. NPU Megatron 训练兼容 megatron-core 0.15.3。(感谢 @addsubmuldiv 的贡献)

新模型

  1. 纯文本模型
    a. ZhipuAI/GLM-5.1
    b. MiniMax/MiniMax-M2.7
    c. moonshotai/Kimi-K2.6(仅含纯文本)
    d. Tencent-Hunyuan/Hy3-preview
    e. AIDC-AI/Marco-Nano-Instruct 系列
  2. 多模态模型
    a. Qwen/Qwen3.6-35B-A3B、Qwen/Qwen3.6-27B
    b. Qwen3-ASR(感谢 @xut806 的贡献)
    c. Gemma4 系列模型混合模态数据集训练支持
    d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
    e. OpenBMB/MiniCPM-o-4_5 新增音频模态支持(感谢 @fanqiNO1 的贡献)
    f. allenai/Molmo2-4B(感谢 @Kagura-0001 的贡献)

English Version

New Features

  1. Megatron-SWIFT
    a. Added model_type support: kimi_k25, hy_v3, llava_onevision. (llava_onevision contributed by @randydl)
    b. Added support for GLM-5 shared-parameter MTP, which can be enabled via the --mtp_shared_weights argument.
    c. Added support for Qwen3.5 FP8 training. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/models/qwen3_5/fp8.sh
    d. Custom Megatron model documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Custom-Model.html
    e. Added support for controlling whether decoder_input stops gradient in the MTP branch (i.e., whether MTP loss can backpropagate gradients through decoder_input to Embedding/ViT), configurable via the --mtp_decoder_input_detach argument.
    f. mlp_padding_free is now compatible with Sequence Parallelism.
    g. Added support for FP8 quantization export via the megatron export command. Script reference: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/quant.sh
    h. Removed dependency compatibility support for megatron-core versions 0.12 - 0.14.
  2. RL
    a. GKD/OPSD now supports the generation_batch_size/steps_per_generation parameters.
    b. GKD/OPSD teacher_server_api is now compatible with multimodal training.
    c. GKD/OPSD is now compatible with padding_free.
    d. Megatron GRPO/GKD weight synchronization now supports syncing LoRA weights only.
    e. Added exception handling to swift rollout to prevent silent process hangs.
    f. GRPO ref_sync_callback now supports layer-wise gather under ZeRO-3 to avoid OOM.
    g. GRPO TRL dependency upgraded to >= 0.26.
  3. Training
    a. Added support for Qwen3.5 Sequence Parallelism, controllable via the --sequence_parallel_size argument. (Contributed by @meichangsu1)
    b. Added support for specifying loss_scale directly in the dataset for more flexible loss control. Documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#supervised-fine-tuning
    c. Dataset dependency is now compatible with datasets 4.x.
    d. cached_dataset is now compatible with the --truncation_strategy split strategy.
  4. Hardware
    a. NPU now supports Qwen3.5 training with transformers/Megatron backends. When using the Megatron backend, the USE_MCORE_GDN=0 environment variable must be set. (Contributed by @addsubmuldiv, @hazelduan)
    b. Added AMD support documentation: https://swift.readthedocs.io/en/latest/BestPractices/AMD-support.html (Contributed by @Treemann)
    c. Added RL training support for MetaX hardware. (Contributed by @suenphey)
    d. NPU Megatron training is now compatible with megatron-core 0.15.3. (Contributed by @addsubmuldiv)

New Models

  1. Text-only Models
    a. ZhipuAI/GLM-5.1
    b. MiniMax/MiniMax-M2.7
    c. moonshotai/Kimi-K2.6 (text-only)
    d. Tencent-Hunyuan/Hy3-preview
    e. AIDC-AI/Marco-Nano-Instruct series
  2. Multimodal Models
    a. Qwen/Qwen3.6-35B-A3B, Qwen/Qwen3.6-27B
    b. Qwen3-ASR (Contributed by @xut806)
    c. Added mixed-modality dataset training support for Gemma4 series models.
    d. OpenDataLab/MinerU2.5-Pro-2604-1.2B
    e. OpenBMB/MiniCPM-o-4_5 now supports audio modality. (Contributed by @fanqiNO1)
    f. allenai/Molmo2-4B (Contributed by @Kagura-0001)

What's Changed

New Contributors

Full Changelog: v4.1.0...v4.2.0