modelscope · Jintao-Huang · Mar 20, 2026 · Mar 24, 2026
diff --git a/docs/source/BestPractices/Qwen3_5-Best-Practice.md b/docs/source/BestPractices/Qwen3_5-Best-Practice.md
@@ -309,7 +309,6 @@ swift infer \
 
 Megatron-SWIFT训练Qwen3.5的提示：
 - 全参数训练：参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh)。
-- 关于MTP训练：ms-swift暂不支持多模态MTP的训练。如果你只训练纯文本数据，请设置`SKIP_MULTIMODAL_MTP_VALIDATION=1`环境变量，忽略检查。
 - TP 限制解除：使用 "megatron-core>=0.16" 可解除 TP 受到的 `num_query_groups` 限制。
 - 默认 `GatedDeltaNet` 使用 transformers 实现（为保证稳定性，暂时保持默认行为不变）。使用 "megatron-core>=0.16"并设置环境变量 `SWIFT_USE_MCORE_GDN=1`可切换至 mcore 实现，支持 GDN 的 TP 并降低显存。
 - padding_free/packing的支持：packing可以提升训练速度，你需要设置`SWIFT_USE_MCORE_GDN=1`环境变量。参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh)。

diff --git a/docs/source_en/BestPractices/Qwen3_5-Best-Practice.md b/docs/source_en/BestPractices/Qwen3_5-Best-Practice.md
@@ -307,7 +307,6 @@ swift infer \
 Tips for training Qwen3.5 with Megatron-SWIFT:
 
 - Full parameter training: Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh).
-- Regarding MTP training: ms-swift currently does not support multimodal MTP training. If you are only training on pure text data, please set the `SKIP_MULTIMODAL_MTP_VALIDATION=1` environment variable to skip the validation check.
 - TP Limitation Removed: Using `megatron-core>=0.16` removes the `num_query_groups` limitation on TP.
 - By default, `GatedDeltaNet` uses the transformers implementation (to ensure stability, the default behavior remains unchanged for now). Using `megatron-core>=0.16` and setting the environment variable `SWIFT_USE_MCORE_GDN=1` switches to the mcore implementation, which supports TP for GDN and reduces memory usage.
 - Support for padding_free/packing: Packing can improve training speed. You need to set the `SWIFT_USE_MCORE_GDN=1` environment variable. Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh).

diff --git a/examples/models/qwen3_5/mcore_full.sh b/examples/models/qwen3_5/mcore_full.sh
@@ -5,7 +5,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 MAX_PIXELS=1003520 \
 VIDEO_MAX_PIXELS=50176 \
 FPS_MAX_FRAMES=12 \
-SKIP_MULTIMODAL_MTP_VALIDATION=1 \
 megatron sft \
     --model Qwen/Qwen3.5-35B-A3B \
     --save_safetensors true \

diff --git a/examples/models/qwen3_5/packing.sh b/examples/models/qwen3_5/packing.sh
@@ -5,7 +5,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 MAX_PIXELS=1003520 \
 VIDEO_MAX_PIXELS=50176 \
 FPS_MAX_FRAMES=12 \
-SKIP_MULTIMODAL_MTP_VALIDATION=1 \
 SWIFT_USE_MCORE_GDN=1 \
 megatron sft \
     --model Qwen/Qwen3.5-35B-A3B \

diff --git a/swift/megatron/model/mm_gpt_model.py b/swift/megatron/model/mm_gpt_model.py
@@ -36,10 +36,6 @@ def __init__(self,
         self.share_embeddings_and_output_weights = self.language_model.share_embeddings_and_output_weights
         self.megatron_model_meta = get_megatron_model_meta(self.args.model_type)
         self.visual = None
-        if self.args.mtp_num_layers:
-            skip_validation = get_env_args('SKIP_MULTIMODAL_MTP_VALIDATION', bool, False)
-            if not skip_validation:
-                raise ValueError('MTP currently does not support multimodal models.')
         if pre_process and self.megatron_model_meta.visual_cls is not None:
             self.visual = self.megatron_model_meta.visual_cls(config)