v3.12.0
中文版
新特性
- Megatron-SWIFT
a. GKD算法支持Megatron训练,文档参考:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/GKD.html
b. 新模型支持:GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V。
c.save_safetensors支持断点续训,将 Mcore-Bridge 加载和存储方式作为推荐方式。
d. 非 padding-free 训练模式支持更多训练阶段:GRPO/DPO/KTO/RM/序列分类。
e. group_by_length 参数支持,将数据集长度大致相同的样本分组在一起(含随机因素),加速非packing模式下训练速度。
f. 支持--report_to参数,将训练日志在 wandb/swanlab 中记录并可视化。
g. Qwen3-Next 使用 Zero-Centered RMSNorm,与 transformers 对齐。
h.train_dataloader_shuffle参数支持,控制训练数据集是否随机。
i. template.encode 新增重试机制,避免 megatron 训练因网络问题获取图片/视频报错而卡住。 - RL
a. 增加 Off-Policy Sequence Masking (from DeepSeek-V3.2),文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
b. GRPO 增加参数 num_generations_eval 设置 eval 阶段的生成数量。
c. 优化 GKD loss 计算的显存峰值。
d. GRPO/GKD server mode 支持使用 ipv6 地址。
e. 支持使用 structured_outputs_regex 进行结构化输出采样。 - 训练
a. embedding/reranker/序列分类任务支持序列 packing 和序列并行。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
b. 支持--fsdp fsdp2使用 ms-swift 内置的 FSDP2 配置文件。
c. loss_scale 支持3种基本策略:'default'、'last_round'、'all'与其他策略的混合使用,例如:'last_round+ignore_empty_think'。
d. cached_dataset 支持 embedding/reranker/序列分类训练任务,训练脚本参考https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
e. thinking template 重构,ThinkingTemplate 功能合入 Template,新增enable_thinking,add_non_thinking_prefix参数。
f. 新增SWIFT_PATCH_CONV3D环境变量,避免 torch2.9 环境跑 conv3d 缓慢的问题。
g. 支持swanlab_notification_method参数,在训练完成/发生错误时,指定 swanlab 通知方式。
h.dataloader_prefetch_factor参数默认值从10修改为2。 - 国产化硬件(感谢昇腾和招商银行技术团队的贡献)
a. 新增更多训练脚本:https://github.com/modelscope/ms-swift/tree/main/examples/ascend
b. Qwen3-VL 混合算子支持,具体查看这个PR:#7079
c. 更新 Megatron-SWIFT NPU 性能采集/精度采集相关文档,参考这里:https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Ascend.html
新模型
- 纯文本模型:
a. ZhipuAI/GLM-4.7系列
b. iic/QwenLong-L1.5-30B-A3B
c. gongjy/MiniMind2 (感谢 @PiggerZZM 的贡献) - 多模态模型:
a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash系列
b. Tencent-Hunyuan/HunyuanOCR
English Version
New Features
- Megatron-SWIFT
a. GKD algorithm supports Megatron training. Documentation reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/GKD.html
b. New model support: GLM4 Dense; GLM4.7; GLM4.6v-Flash, GLM-4.1V.
c.save_safetensorssupports checkpoint resumption, with Mcore-Bridge loading and storage method as the recommended approach.
d. Non-padding-free training mode supports more training stages: GRPO/DPO/KTO/RM/sequence classification.
e.group_by_lengthparameter support, grouping samples with similar lengths in the dataset together (with random factors) to accelerate training speed in non-packing mode.
f. Support for--report_toparameter to log and visualize training logs in wandb/swanlab.
g. Qwen3-Next uses Zero-Centered RMSNorm, aligned with transformers.
h.train_dataloader_shuffleparameter support to control whether training dataset is shuffled.
i. Added retry mechanism to template.encode to prevent megatron training from freezing due to network issues when fetching images/videos. - RL
a. Added Off-Policy Sequence Masking (from DeepSeek-V3.2). Documentation reference: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/training_inference_mismatch.html#off-policy-sequence-masking
b. GRPO addsnum_generations_evalparameter to set the number of generations during eval stage.
c. Optimized memory peak for GKD loss calculation.
d. GRPO/GKD server mode supports using ipv6 addresses.
e. Support for structured output sampling usingstructured_outputs_regex. - Training
a. Embedding/reranker/sequence classification tasks support sequence packing and sequence parallelism. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel
b. Support for--fsdp fsdp2to use ms-swift built-in FSDP2 configuration file.
c.loss_scalesupports 3 basic strategies: 'default', 'last_round', 'all' and their hybrid use with other strategies, e.g., 'last_round+ignore_empty_think'.
d.cached_datasetsupports embedding/reranker/sequence classification training tasks. Training script reference: https://github.com/modelscope/ms-swift/tree/main/examples/train/cached_dataset
e. Thinking template refactored, ThinkingTemplate functionality merged into Template, addedenable_thinkingandadd_non_thinking_prefixparameters.
f. AddedSWIFT_PATCH_CONV3Denvironment variable to avoid slow conv3d execution in torch2.9 environment.
g. Support forswanlab_notification_methodparameter to specify swanlab notification method when training completes/errors occur.
h.dataloader_prefetch_factorparameter default value changed from 10 to 2. - Domestic Hardware (Thanks to Ascend and CMB technical teams)
a. Added more training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/ascend
b. Qwen3-VL hybrid operator support, see this PR: #7079
c. Updated Megatron-SWIFT NPU performance collection/accuracy collection documentation, reference: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Ascend.html
New Models
- Text-only models:
a. ZhipuAI/GLM-4.7 series
b. iic/QwenLong-L1.5-30B-A3B
c. gongjy/MiniMind2 (Thanks to @PiggerZZM's contribution) - Multimodal models:
a. ZhipuAI/GLM-4.6V; ZhipuAI/GLM-4.6V-Flash series
b. Tencent-Hunyuan/HunyuanOCR
What's Changed
- [model] Support GLM4.6-V by @Jintao-Huang in #6948
- [model] support glm4_6v flash by @Jintao-Huang in #6959
- [bugfix] fix truncation_strategy left by @Jintao-Huang in #6961
- [bugfix] fix megatron save_checkpoint by @Jintao-Huang in #6963
- [feat] GKD support truncation strategy delete to resample by @hjh0119 in #6964
- [misc] megatron grpo check rollout_logps by @hjh0119 in #6970
- [misc] set default group_port for vllm client by @hjh0119 in #6972
- [grpo] support Off-Policy Sequence Masking by @hjh0119 in #6978
- [megatron, misc] support check_latest_model by @hjh0119 in #6988
- [bugfix] fix reranker_padding_free by @Jintao-Huang in #6989
- [megatron] fix eval_iters 1 by @Jintao-Huang in #6990
- Add dense_npu.sh for megatron lora training in huawei npu by @vx120 in #6976
- fix system
swift ptby @Jintao-Huang in #7003 - [bugfix] fix qwen_vl_utils torchvision base64 by @Jintao-Huang in #7004
- [bugfix] fix liger_kernel flash_attn by @Jintao-Huang in #7005
- [bugfix] fix qwen3_vl bridge by @Jintao-Huang in #7006
- [bugfix] fix reranker padding_free & fix seq_cls omni padding_free by @Jintao-Huang in #7007
- [npu] add npu qwen3_omni sft example for mindspeed backend by @tongtong0613 in #7008
- [bugfix] qwen-omni3 vllm infer with USE_AUDIO_IN_VIDEO by @hjh0119 in #7009
- [bugfix] fix grpo sleep_level 2 causes gibberish outputs by @hjh0119 in #7017
- add npu vllm-ascend docs and examples by @addsubmuldiv in #7013
- [compat] fix mcore012 compat torch new by @Jintao-Huang in #7021
- [megatron] Megatron support random/non-random dataloader by @Jintao-Huang in #7016
- [bugfix] megatron add retry to avoid hang by @Jintao-Huang in #7023
- [trainer] refactor acc metrics by @Jintao-Huang in #7026
- [infer] update embddding/reranker demo by @Jintao-Huang in #7029
- [train] support embeding/reranker packing & support reranker/embedding cache_dataset by @Jintao-Huang in #6987
- update readme by @Jintao-Huang in #7033
- [misc] update swift image by @Jintao-Huang in #7039
- [bugfix] remove add_eos for rm in grpo by @hjh0119 in #7040
- [npu] Fix device mismatch in weight sync for HCCL communicator by @singing4you in #7036
- collect npu profiling data by @OneMondy in #6977
- [bugfix] fix null_ref_context by @Jintao-Huang in #7042
- [model] support hunyuan_ocr by @slin000111 in #7038
- update flash_attn version; fix mcore 0.15 hang by @Jintao-Huang in #7043
- [bugfix] fix grpo multi turn log_entropy by @hjh0119 in #7044
- [bugfix] fix dataloader megatron by @Jintao-Huang in #7050
- [grpo] support num_generations_eval by @hjh0119 in #7046
- fix dpo sp by @tastelikefeet in #7051
- fix GKD trainer use_kd for mllm and optimize encoding by @hjh0119 in #7057
- [bugfix] fix megatron seq_cls lora bridge by @Jintao-Huang in #7054
- [feat] rollout support ipv6 address by @hjh0119 in #7071
- fix mistral3 vllm backend ignore consolidated.safetensors by @hjh0119 in #7074
- [cli] megatron compat accelerate by @Jintao-Huang in #7073
- Add support for megatron lora in huawei NPU by @vx120 in #7068
- [megatron] Update megatron shells by @Jintao-Huang in #6967
- [bugfix] fix megatron ref_adapter by @Jintao-Huang in #7077
- fix by @tastelikefeet in #7082
- [npu] Fix the failure in mcore version check on NPU device by @tongtong0613 in #7078
- [bugfix] fix mps by @Jintao-Huang in #7086
- Update release by @Jintao-Huang in #7093
- performance optimized for qwen3_vl by @OneMondy in #7087
- [bugfix] fix mtp qwen3_next by @Jintao-Huang in #7048
- [megatron] update qwen3_next megatron layer_norm by @Jintao-Huang in #7097
- qwen3_vl_fuse by @addsubmuldiv in #7079
- add qwen3_vl hangs docs by @Jintao-Huang in #7115
- [template] refactor retry by @Jintao-Huang in #7116
- support SWIFT_PATCH_CONV3D by @Jintao-Huang in #7122
- support fsdp2 by @hjh0119 in #7118
- [bugfix] fix response_prefix by @Jintao-Huang in #7126
- [template] refactor thinking template & loss_scale by @Jintao-Huang in #7096
- [docs] update args docs & fix hunyuan ocr by @Jintao-Huang in #7143
- [docs] fp8 test_convert_precision by @Jintao-Huang in #7148
- Support Ulysses for seq_cls/embedding/reranker by @0russwest0 in #7147
- Update FAQ by @slin000111 in #7151
- [misc] LLM v0.13.0 compatibility by @hjh0119 in #7152
- fix: correct KL metrics in rollout importance sampling by @hjh0119 in #7145
- [train] support group_by_length by @Jintao-Huang in #7149
- [bugfix] fix AssertionError vp_stage must be a kwarg in train_valid_t… by @donpromax in #7158
- [model] support minimind by @PiggerZZM in #7136
- [dataset] support cache_dataset sample by @Jintao-Huang in #7165
- [bugfix] fix interleave_prob by @Jintao-Huang in #7166
- [bugfix] fix megatron mcore-bridge lora target_modules by @Jintao-Huang in #7175
- [model] support GLM-4.7 by @Jintao-Huang in #7173
- [megatron] support glm4 dense by @Jintao-Huang in #7177
- [misc] support disable_gradient_checkpointing context by @hjh0119 in #7180
- [feat] Optimize the peak memory usage of the GKD JSD loss. by @hjh0119 in #7164
- [bugfix] fix mcore-bridge gate_up_proj by @Jintao-Huang in #7181
- [megatron] support glm4_6 flash megatron by @Jintao-Huang in #7172
- [bugfix] fix megatron nan by @Jintao-Huang in #7187
- [bugfix] Fix dacite deserialization error for objects field in RolloutInferRequest by @hjh0119 in #7189
- [template] remove compat and update docs by @Jintao-Huang in #7192
- [bugfix] fix qwen3_omni position_ids dtype by @Jintao-Huang in #7194
- add msprobe support by @Vectorwh in #7178
- update faq by @slin000111 in #7195
- [model] support glm4.6v-flash padding_free/packing by @Jintao-Huang in #7197
- [misc] add megatron trainer state to align with transformer trainer by @hjh0119 in #7199
- [bugfix] fix rope_scaling by @Jintao-Huang in #7198
- [bugfix] fix glm4_6v-flash agent template by @Jintao-Huang in #7203
- [args] update dataloader_prefetch_factor by @Jintao-Huang in #7207
- [megatron] megatron support padding_free false by @Jintao-Huang in #7205
- update npu document by @addsubmuldiv in #7212
- [swanlab] Update swanlab notification method by @Jintao-Huang in #7213
- [misc] support structured_outputs_regex by @hjh0119 in #7215
- [megatron] support qwen3_omni dense by @Jintao-Huang in #7217
- [feat] support megatron gkd by @hjh0119 in #7216
- [bugfix] fix megatron grpo by @hjh0119 in #7222
- [megatron] support swanlab megatron by @Jintao-Huang in #7211
- [misc] update swift image by @Jintao-Huang in #7230
- [bugfix] fix megatron non-padding_free qwen3_vl cp by @Jintao-Huang in #7233
- Make the default value of fsdp compatible with transformers less than 4.57 by @slin000111 in #7235
- [misc] megatron grpo support non-padding-free by @hjh0119 in #7218
- [model] support qwenlong L1.5 by @Jintao-Huang in #7237
- Verify sequence parallel for seq_cls by @slin000111 in #7240
- [feat] support dense/moe mixed models in Megatron GKD by @hjh0119 in #7241
- [bugfix] fix lora in vllm >= v0.12 by @liuyanyi in #7245
- [bugfix] fix megatron gkd mixed model by @hjh0119 in #7247
New Contributors
- @vx120 made their first contribution in #6976
- @singing4you made their first contribution in #7036
- @OneMondy made their first contribution in #6977
- @donpromax made their first contribution in #7158
- @PiggerZZM made their first contribution in #7136
- @Vectorwh made their first contribution in #7178
Full Changelog: v3.11.0...v3.12.0