Release v3.12.0 · modelscope/ms-swift

中文版

新特性

Megatron-SWIFT
a. GKD算法支持Megatron训练，文档参考：https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/GKD.html
b. 新模型支持：GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V。
c. save_safetensors 支持断点续训，将 Mcore-Bridge 加载和存储方式作为推荐方式。
d. 非 padding-free 训练模式支持更多训练阶段：GRPO/DPO/KTO/RM/序列分类。
e. group_by_length 参数支持，将数据集长度大致相同的样本分组在一起（含随机因素），加速非packing模式下训练速度。
f. 支持 --report_to 参数，将训练日志在 wandb/swanlab 中记录并可视化。
g. Qwen3-Next 使用 Zero-Centered RMSNorm，与 transformers 对齐。
h. train_dataloader_shuffle 参数支持，控制训练数据集是否随机。
i. template.encode 新增重试机制，避免 megatron 训练因网络问题获取图片/视频报错而卡住。
RL
a. 增加 Off-Policy Sequence Masking (from DeepSeek-V3.2)，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
b. GRPO 增加参数 num_generations_eval 设置 eval 阶段的生成数量。
c. 优化 GKD loss 计算的显存峰值。
d. GRPO/GKD server mode 支持使用 ipv6 地址。
e. 支持使用 structured_outputs_regex 进行结构化输出采样。
训练
a. embedding/reranker/序列分类任务支持序列 packing 和序列并行。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
b. 支持 --fsdp fsdp2 使用 ms-swift 内置的 FSDP2 配置文件。
c. loss_scale 支持3种基本策略：'default'、'last_round'、'all'与其他策略的混合使用，例如：'last_round+ignore_empty_think'。
d. cached_dataset 支持 embedding/reranker/序列分类训练任务，训练脚本参考https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
e. thinking template 重构，ThinkingTemplate 功能合入 Template，新增enable_thinking, add_non_thinking_prefix参数。
f. 新增 SWIFT_PATCH_CONV3D 环境变量，避免 torch2.9 环境跑 conv3d 缓慢的问题。
g. 支持 swanlab_notification_method 参数，在训练完成/发生错误时，指定 swanlab 通知方式。
h. dataloader_prefetch_factor 参数默认值从10修改为2。
国产化硬件（感谢昇腾和招商银行技术团队的贡献）
a. 新增更多训练脚本：https://github.com/modelscope/ms-swift/tree/main/examples/ascend
b. Qwen3-VL 混合算子支持，具体查看这个PR：#7079
c. 更新 Megatron-SWIFT NPU 性能采集/精度采集相关文档，参考这里：https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Ascend.html

新模型

纯文本模型：
a. ZhipuAI/GLM-4.7系列
b. iic/QwenLong-L1.5-30B-A3B
c. gongjy/MiniMind2 （感谢 @PiggerZZM 的贡献）
多模态模型：
a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash系列
b. Tencent-Hunyuan/HunyuanOCR

English Version

New Features

Megatron-SWIFT
a. GKD algorithm supports Megatron training. Documentation reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/GKD.html
b. New model support: GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V.
c. save_safetensors supports checkpoint resumption, with Mcore-Bridge loading and storage method as the recommended approach.
d. Non-padding-free training mode supports more training stages: GRPO/DPO/KTO/RM/sequence classification.
e. group_by_length parameter support, grouping samples with similar lengths in the dataset together (with random factors) to accelerate training speed in non-packing mode.
f. Support for --report_to parameter to log and visualize training logs in wandb/swanlab.
g. Qwen3-Next uses Zero-Centered RMSNorm, aligned with transformers.
h. train_dataloader_shuffle parameter support to control whether training dataset is shuffled.
i. Added retry mechanism to template.encode to prevent megatron training from freezing due to network issues when fetching images/videos.
RL
a. Added Off-Policy Sequence Masking (from DeepSeek-V3.2). Documentation reference: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
b. GRPO adds num_generations_eval parameter to set the number of generations during eval stage.
c. Optimized memory peak for GKD loss calculation.
d. GRPO/GKD server mode supports using ipv6 addresses.
e. Support for structured output sampling using structured_outputs_regex.
Training
a. Embedding/reranker/sequence classification tasks support sequence packing and sequence parallelism. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
b. Support for --fsdp fsdp2 to use ms-swift built-in FSDP2 configuration file.
c. loss_scale supports 3 basic strategies: 'default', 'last_round', 'all' and their hybrid use with other strategies, e.g., 'last_round+ignore_empty_think'.
d. cached_dataset supports embedding/reranker/sequence classification training tasks. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
e. Thinking template refactored, ThinkingTemplate functionality merged into Template, added enable_thinking and add_non_thinking_prefix parameters.
f. Added SWIFT_PATCH_CONV3D environment variable to avoid slow conv3d execution in torch2.9 environment.
g. Support for swanlab_notification_method parameter to specify swanlab notification method when training completes/errors occur.
h. dataloader_prefetch_factor parameter default value changed from 10 to 2.
Domestic Hardware (Thanks to Ascend and CMB technical teams)
a. Added more training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/ascend
b. Qwen3-VL hybrid operator support, see this PR: #7079
c. Updated Megatron-SWIFT NPU performance collection/accuracy collection documentation, reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Ascend.html

New Models

Text-only models:
a. ZhipuAI/GLM-4.7 series
b. iic/QwenLong-L1.5-30B-A3B
c. gongjy/MiniMind2 (Thanks to @PiggerZZM's contribution)
Multimodal models:
a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash series
b. Tencent-Hunyuan/HunyuanOCR

What's Changed

[model] Support GLM4.6-V by @Jintao-Huang in #6948
[model] support glm4_6v flash by @Jintao-Huang in #6959
[bugfix] fix truncation_strategy left by @Jintao-Huang in #6961
[bugfix] fix megatron save_checkpoint by @Jintao-Huang in #6963
[feat] GKD support truncation strategy delete to resample by @hjh0119 in #6964
[misc] megatron grpo check rollout_logps by @hjh0119 in #6970
[misc] set default group_port for vllm client by @hjh0119 in #6972
[grpo] support Off-Policy Sequence Masking by @hjh0119 in #6978
[megatron, misc] support check_latest_model by @hjh0119 in #6988
[bugfix] fix reranker_padding_free by @Jintao-Huang in #6989
[megatron] fix eval_iters 1 by @Jintao-Huang in #6990
Add dense_npu.sh for megatron lora training in huawei npu by @vx120 in #6976
fix system swift pt by @Jintao-Huang in #7003
[bugfix] fix qwen_vl_utils torchvision base64 by @Jintao-Huang in #7004
[bugfix] fix liger_kernel flash_attn by @Jintao-Huang in #7005
[bugfix] fix qwen3_vl bridge by @Jintao-Huang in #7006
[bugfix] fix reranker padding_free & fix seq_cls omni padding_free by @Jintao-Huang in #7007
[npu] add npu qwen3_omni sft example for mindspeed backend by @tongtong0613 in #7008
[bugfix] qwen-omni3 vllm infer with USE_AUDIO_IN_VIDEO by @hjh0119 in #7009
[bugfix] fix grpo sleep_level 2 causes gibberish outputs by @hjh0119 in #7017
add npu vllm-ascend docs and examples by @addsubmuldiv in #7013
[compat] fix mcore012 compat torch new by @Jintao-Huang in #7021
[megatron] Megatron support random/non-random dataloader by @Jintao-Huang in #7016
[bugfix] megatron add retry to avoid hang by @Jintao-Huang in #7023
[trainer] refactor acc metrics by @Jintao-Huang in #7026
[infer] update embddding/reranker demo by @Jintao-Huang in #7029
[train] support embeding/reranker packing & support reranker/embedding cache_dataset by @Jintao-Huang in #6987
update readme by @Jintao-Huang in #7033
[misc] update swift image by @Jintao-Huang in #7039
[bugfix] remove add_eos for rm in grpo by @hjh0119 in #7040
[npu] Fix device mismatch in weight sync for HCCL communicator by @singing4you in #7036
collect npu profiling data by @OneMondy in #6977
[bugfix] fix null_ref_context by @Jintao-Huang in #7042
[model] support hunyuan_ocr by @slin000111 in #7038
update flash_attn version; fix mcore 0.15 hang by @Jintao-Huang in #7043
[bugfix] fix grpo multi turn log_entropy by @hjh0119 in #7044
[bugfix] fix dataloader megatron by @Jintao-Huang in #7050
[grpo] support num_generations_eval by @hjh0119 in #7046
fix dpo sp by @tastelikefeet in #7051
fix GKD trainer use_kd for mllm and optimize encoding by @hjh0119 in #7057
[bugfix] fix megatron seq_cls lora bridge by @Jintao-Huang in #7054
[feat] rollout support ipv6 address by @hjh0119 in #7071
fix mistral3 vllm backend ignore consolidated.safetensors by @hjh0119 in #7074
[cli] megatron compat accelerate by @Jintao-Huang in #7073
Add support for megatron lora in huawei NPU by @vx120 in #7068
[megatron] Update megatron shells by @Jintao-Huang in #6967
[bugfix] fix megatron ref_adapter by @Jintao-Huang in #7077
fix by @tastelikefeet in #7082
[npu] Fix the failure in mcore version check on NPU device by @tongtong0613 in #7078
[bugfix] fix mps by @Jintao-Huang in #7086
Update release by @Jintao-Huang in #7093
performance optimized for qwen3_vl by @OneMondy in #7087
[bugfix] fix mtp qwen3_next by @Jintao-Huang in #7048
[megatron] update qwen3_next megatron layer_norm by @Jintao-Huang in #7097
qwen3_vl_fuse by @addsubmuldiv in #7079
add qwen3_vl hangs docs by @Jintao-Huang in #7115
[template] refactor retry by @Jintao-Huang in #7116
support SWIFT_PATCH_CONV3D by @Jintao-Huang in #7122
support fsdp2 by @hjh0119 in #7118
[bugfix] fix response_prefix by @Jintao-Huang in #7126
[template] refactor thinking template & loss_scale by @Jintao-Huang in #7096
[docs] update args docs & fix hunyuan ocr by @Jintao-Huang in #7143
[docs] fp8 test_convert_precision by @Jintao-Huang in #7148
Support Ulysses for seq_cls/embedding/reranker by @0russwest0 in #7147
Update FAQ by @slin000111 in #7151
[misc] LLM v0.13.0 compatibility by @hjh0119 in #7152
fix: correct KL metrics in rollout importance sampling by @hjh0119 in #7145
[train] support group_by_length by @Jintao-Huang in #7149
[bugfix] fix AssertionError vp_stage must be a kwarg in train_valid_t… by @donpromax in #7158
[model] support minimind by @PiggerZZM in #7136
[dataset] support cache_dataset sample by @Jintao-Huang in #7165
[bugfix] fix interleave_prob by @Jintao-Huang in #7166
[bugfix] fix megatron mcore-bridge lora target_modules by @Jintao-Huang in #7175
[model] support GLM-4.7 by @Jintao-Huang in #7173
[megatron] support glm4 dense by @Jintao-Huang in #7177
[misc] support disable_gradient_checkpointing context by @hjh0119 in #7180
[feat] Optimize the peak memory usage of the GKD JSD loss. by @hjh0119 in #7164
[bugfix] fix mcore-bridge gate_up_proj by @Jintao-Huang in #7181
[megatron] support glm4_6 flash megatron by @Jintao-Huang in #7172
[bugfix] fix megatron nan by @Jintao-Huang in #7187
[bugfix] Fix dacite deserialization error for objects field in RolloutInferRequest by @hjh0119 in #7189
[template] remove compat and update docs by @Jintao-Huang in #7192
[bugfix] fix qwen3_omni position_ids dtype by @Jintao-Huang in #7194
add msprobe support by @Vectorwh in #7178
update faq by @slin000111 in #7195
[model] support glm4.6v-flash padding_free/packing by @Jintao-Huang in #7197
[misc] add megatron trainer state to align with transformer trainer by @hjh0119 in #7199
[bugfix] fix rope_scaling by @Jintao-Huang in #7198
[bugfix] fix glm4_6v-flash agent template by @Jintao-Huang in #7203
[args] update dataloader_prefetch_factor by @Jintao-Huang in #7207
[megatron] megatron support padding_free false by @Jintao-Huang in #7205
update npu document by @addsubmuldiv in #7212
[swanlab] Update swanlab notification method by @Jintao-Huang in #7213
[misc] support structured_outputs_regex by @hjh0119 in #7215
[megatron] support qwen3_omni dense by @Jintao-Huang in #7217
[feat] support megatron gkd by @hjh0119 in #7216
[bugfix] fix megatron grpo by @hjh0119 in #7222
[megatron] support swanlab megatron by @Jintao-Huang in #7211
[misc] update swift image by @Jintao-Huang in #7230
[bugfix] fix megatron non-padding_free qwen3_vl cp by @Jintao-Huang in #7233
Make the default value of fsdp compatible with transformers less than 4.57 by @slin000111 in #7235
[misc] megatron grpo support non-padding-free by @hjh0119 in #7218
[model] support qwenlong L1.5 by @Jintao-Huang in #7237
Verify sequence parallel for seq_cls by @slin000111 in #7240
[feat] support dense/moe mixed models in Megatron GKD by @hjh0119 in #7241
[bugfix] fix lora in vllm >= v0.12 by @liuyanyi in #7245
[bugfix] fix megatron gkd mixed model by @hjh0119 in #7247

New Contributors

@vx120 made their first contribution in #6976
@singing4you made their first contribution in #7036
@OneMondy made their first contribution in #6977
@donpromax made their first contribution in #7158
@PiggerZZM made their first contribution in #7136
@Vectorwh made their first contribution in #7178

Full Changelog: v3.11.0...v3.12.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.12.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

中文版

新特性

新模型

English Version

New Features

New Models

What's Changed

New Contributors

Contributors

Uh oh!