Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion docs/source/BestPractices/Qwen3_5-Best-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,6 @@ swift infer \

Megatron-SWIFT训练Qwen3.5的提示:
- 全参数训练:参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh)。
- 关于MTP训练:ms-swift暂不支持多模态MTP的训练。如果你只训练纯文本数据,请设置`SKIP_MULTIMODAL_MTP_VALIDATION=1`环境变量,忽略检查。
- TP 限制解除:使用 "megatron-core>=0.16" 可解除 TP 受到的 `num_query_groups` 限制。
- 默认 `GatedDeltaNet` 使用 transformers 实现(为保证稳定性,暂时保持默认行为不变)。使用 "megatron-core>=0.16"并设置环境变量 `SWIFT_USE_MCORE_GDN=1`可切换至 mcore 实现,支持 GDN 的 TP 并降低显存。
- padding_free/packing的支持:packing可以提升训练速度,你需要设置`SWIFT_USE_MCORE_GDN=1`环境变量。参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh)。
Expand Down
1 change: 0 additions & 1 deletion docs/source_en/BestPractices/Qwen3_5-Best-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,6 @@ swift infer \
Tips for training Qwen3.5 with Megatron-SWIFT:

- Full parameter training: Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh).
- Regarding MTP training: ms-swift currently does not support multimodal MTP training. If you are only training on pure text data, please set the `SKIP_MULTIMODAL_MTP_VALIDATION=1` environment variable to skip the validation check.
- TP Limitation Removed: Using `megatron-core>=0.16` removes the `num_query_groups` limitation on TP.
- By default, `GatedDeltaNet` uses the transformers implementation (to ensure stability, the default behavior remains unchanged for now). Using `megatron-core>=0.16` and setting the environment variable `SWIFT_USE_MCORE_GDN=1` switches to the mcore implementation, which supports TP for GDN and reduces memory usage.
- Support for padding_free/packing: Packing can improve training speed. You need to set the `SWIFT_USE_MCORE_GDN=1` environment variable. Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh).
Expand Down
1 change: 0 additions & 1 deletion examples/models/qwen3_5/mcore_full.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
SKIP_MULTIMODAL_MTP_VALIDATION=1 \
megatron sft \
--model Qwen/Qwen3.5-35B-A3B \
--save_safetensors true \
Expand Down
1 change: 0 additions & 1 deletion examples/models/qwen3_5/packing.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
MAX_PIXELS=1003520 \
VIDEO_MAX_PIXELS=50176 \
FPS_MAX_FRAMES=12 \
SKIP_MULTIMODAL_MTP_VALIDATION=1 \
SWIFT_USE_MCORE_GDN=1 \
megatron sft \
--model Qwen/Qwen3.5-35B-A3B \
Expand Down
4 changes: 0 additions & 4 deletions swift/megatron/model/mm_gpt_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,6 @@ def __init__(self,
self.share_embeddings_and_output_weights = self.language_model.share_embeddings_and_output_weights
self.megatron_model_meta = get_megatron_model_meta(self.args.model_type)
self.visual = None
if self.args.mtp_num_layers:
skip_validation = get_env_args('SKIP_MULTIMODAL_MTP_VALIDATION', bool, False)
if not skip_validation:
raise ValueError('MTP currently does not support multimodal models.')
if pre_process and self.megatron_model_meta.visual_cls is not None:
self.visual = self.megatron_model_meta.visual_cls(config)

Expand Down
Loading