Skip to content

Commit 749f6d4

Browse files
authored
[docs] Update the NPU dependency versions and scripts. (#9500)
1 parent 139a3d7 commit 749f6d4

3 files changed

Lines changed: 88 additions & 22 deletions

File tree

docs/source/BestPractices/NPU-support.md

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
| torch_npu | >= 2.7.1.post4 |
2525

2626
基础环境准备请参照 [Ascend PyTorch 安装文档](https://gitcode.com/Ascend/pytorch)。本文示例实验环境为 8 * 昇腾910B3 64G。
27+
注:vllm ascend系列官方推荐版本配套已更新至 CANN9.0.0 torch 2.9.0 torch_npu 2.9.0 vllm-ascend 0.18.0(A3) 0.19.1(A5),详情请参阅 [vLLM Ascend 安装文档](https://docs.vllm.ai/projects/ascend/en/v0.18.0/installation.html)
2728

2829
| 一级特性 | 特性 | 进展 |
2930
| -------- | ------------------- | -------- |
@@ -41,7 +42,7 @@
4142
| | QLoRA | 暂不支持 |
4243
| RLHF | GRPO | 已支持 |
4344
| | PPO | 已支持 |
44-
| 性能优化 | FA 等融合算子 | 已支持 |
45+
| 性能优化 | FA 等融合算子 | 已支持 |
4546
| | Liger-Kernel | 暂不支持 |
4647
| 部署 | PT | 已支持 |
4748
| | vLLM | 已支持 |
@@ -61,11 +62,11 @@
6162
| SFT | Qwen3-30B-A3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
6263
| SFT | Qwen3-32B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
6364
| SFT | Qwen3-VL-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
64-
| SFT | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
65+
| SFT | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed/Megatron | Atlas 900 A2 PODc/A3 SuperPoD |
6566
| SFT | InternVL3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
6667
| SFT | Ovis2.5-2B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
67-
| SFT | Qwen3.5-27B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
68-
| SFT | Qwen3.5-35B-A3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
68+
| SFT | Qwen3.5-27B | FSDP1/FSDP2/deepspeed/Megatron | Atlas 900 A2 PODc/A3 SuperPoD |
69+
| SFT | Qwen3.5-35B-A3B | FSDP1/FSDP2/deepspeed/Megatron | Atlas 900 A2 PODc/A3 SuperPoD |
6970

7071
### 已验证 RL 组合
7172

@@ -168,7 +169,7 @@ cd ms-swift
168169
pip install -e .
169170

170171
# 安装 torch_npu
171-
pip install torch_npu==2.7.1.post4 decorator
172+
pip install torch_npu==2.9.0 decorator
172173
# 如果你想要使用 deepspeed(控制显存占用,训练速度会有一定下降)
173174
pip install deepspeed
174175

@@ -198,16 +199,16 @@ print(torch.randn(10, device='npu:0'))
198199
如果需要使用 MindSpeed(Megatron-LM),请按照下面引导安装必要依赖。
199200

200201
```shell
201-
# 1. 获取并切换 Megatron-LM 至 v0.15.3 版本
202+
# 1. 获取并切换 Megatron-LM 至 v0.16.0 版本
202203
git clone https://github.com/NVIDIA/Megatron-LM.git
203204
cd Megatron-LM
204-
git checkout v0.15.3
205+
git checkout v0.16.0
205206
cd ..
206207

207208
# 2. 获取并安装 MindSpeed
208209
git clone https://gitcode.com/Ascend/MindSpeed.git
209210
cd MindSpeed
210-
git checkout core_r0.15.3
211+
git checkout core_r0.16.0
211212
pip install -e .
212213
cd ..
213214

@@ -217,11 +218,14 @@ cd mcore-bridge
217218
pip install -e .
218219
cd ..
219220

220-
# 4. 设置环境变量
221+
# 4. 获取并安装 triton-ascend
222+
pip install triton-ascend==3.2.1 --extra-index-url=https://triton-ascend.osinfra.cn/pypi/simple
223+
224+
# 5. 设置环境变量
221225
export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path>
222226
export MEGATRON_LM_PATH=<your_local_megatron_lm_path>
223227

224-
# 5. 如需回退到 transformers 的 GatedDeltaNet 实现,可关闭 Megatron GDN
228+
# 6. 如需回退到 transformers 的 GatedDeltaNet 实现,可关闭 Megatron GDN
225229
export USE_MCORE_GDN=0
226230
```
227231

@@ -258,8 +262,9 @@ Qwen3.5 modeling.chunk_gated_delta_rule
258262

259263
- 该 patch 主要覆盖的是 **Qwen3.5 linear attention 的 gated-delta-rule 路径**
260264
- 它并不等价于“将整个 fla 包完整替换为 MindSpeed”;
261-
- 若需要这条路径生效,请确保当前环境中可以正确导入 MindSpeed。
262-
- 精度对齐验证版本:torch 2.7.1 + MindSpeed 0.12.1 + flash-linear-attention 4.1.0 + triton-ascend 3.2.0 + transformers 5.2.0
265+
- 若需要这条路径生效,请确保当前环境中可以正确导入 MindSpeed 和 triton ascend
266+
- 精度对齐验证版本:torch 2.9.0 + MindSpeed 0.16.0 + flash-linear-attention 0.4.2 + triton-ascend 3.2.1 + transformers 5.2.0
267+
263268

264269
当前 Qwen3.5 在 NPU 上如果走 Megatron-SWIFT 训练,还需要额外注意版本和功能约束:
265270

docs/source_en/BestPractices/NPU-support.md

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Recommended base environment versions:
2222
| CANN | >= 8.5.1 |
2323
| torch | >= 2.7.1 |
2424
| torch_npu | >= 2.7.1.post4 |
25+
Note: The officially recommended version compatibility matrix for the vLLM Ascend series has been updated to CANN 9.0.0, torch 2.9.0, torch_npu 2.9.0, vllm-ascend 0.18.0 for A3, and vllm-ascend 0.19.1 for A5. For details, see the [vLLM Ascend installation guide](https://docs.vllm.ai/projects/ascend/en/v0.18.0/installation.html).
2526

2627
For base environment setup, see the [Ascend PyTorch installation guide](https://gitcode.com/Ascend/pytorch). The examples in this document were verified on 8 * Ascend 910B3 64G.
2728

@@ -61,11 +62,11 @@ For base environment setup, see the [Ascend PyTorch installation guide](https://
6162
| SFT | Qwen3-30B-A3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
6263
| SFT | Qwen3-32B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
6364
| SFT | Qwen3-VL-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
64-
| SFT | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
65+
| SFT | Qwen3-Omni-30B-A3B-Instruct | FSDP1/FSDP2/deepspeed/Megatron | Atlas 900 A2 PODc/A3 SuperPoD |
6566
| SFT | InternVL3-8B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
6667
| SFT | Ovis2.5-2B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
67-
| SFT | Qwen3.5-27B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
68-
| SFT | Qwen3.5-35B-A3B | FSDP1/FSDP2/deepspeed | Atlas 900 A2 PODc |
68+
| SFT | Qwen3.5-27B | FSDP1/FSDP2/deepspeed/Megatron | Atlas 900 A2 PODc/A3 SuperPoD |
69+
| SFT | Qwen3.5-35B-A3B | FSDP1/FSDP2/deepspeed/Megatron | Atlas 900 A2 PODc/A3 SuperPoD |
6970

7071
### Verified RL Combinations
7172

@@ -170,7 +171,7 @@ cd ms-swift
170171
pip install -e .
171172

172173
# Install torch_npu
173-
pip install torch_npu==2.7.1.post4 decorator
174+
pip install torch_npu==2.9.0 decorator
174175
# If you want to use deepspeed (to reduce memory usage, with some speed overhead)
175176
pip install deepspeed
176177

@@ -200,16 +201,16 @@ print(torch.randn(10, device='npu:0'))
200201
If you need MindSpeed(Megatron-LM), install the required dependencies as follows.
201202

202203
```shell
203-
# 1. Clone Megatron-LM and switch to v0.15.3
204+
# 1. Clone Megatron-LM and switch to v0.16.0
204205
git clone https://github.com/NVIDIA/Megatron-LM.git
205206
cd Megatron-LM
206-
git checkout v0.15.3
207+
git checkout v0.16.0
207208
cd ..
208209

209210
# 2. Clone and install MindSpeed
210211
git clone https://gitcode.com/Ascend/MindSpeed.git
211212
cd MindSpeed
212-
git checkout core_r0.15.3
213+
git checkout core_r0.16.0
213214
pip install -e .
214215
cd ..
215216

@@ -219,11 +220,14 @@ cd mcore-bridge
219220
pip install -e .
220221
cd ..
221222

222-
# 4. Set environment variables
223+
# 4. Download and install triton-ascend
224+
pip install triton-ascend==3.2.1 --extra-index-url=https://triton-ascend.osinfra.cn/pypi/simple
225+
226+
# 5. Set environment variables
223227
export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path>
224228
export MEGATRON_LM_PATH=<your_local_megatron_lm_path>
225229

226-
# 5. Disable Megatron GDN if you need to fall back to the transformers GatedDeltaNet implementation
230+
# 6. Disable Megatron GDN if you need to fall back to the transformers GatedDeltaNet implementation
227231
export USE_MCORE_GDN=0
228232
```
229233

@@ -262,7 +266,7 @@ Therefore:
262266
- This patch mainly covers the **gated-delta-rule path of Qwen3.5 linear attention**.
263267
- It is not equivalent to “fully replacing the entire fla package with MindSpeed”.
264268
- To make this path effective, ensure that MindSpeed can be imported correctly in the current environment.
265-
- Verified versions for accuracy alignment: torch 2.7.1 + MindSpeed 0.12.1 + flash-linear-attention 4.1.0 + triton-ascend 3.2.0 + transformers 5.2.0
269+
- Verified versions for accuracy alignment: torch 2.9.0 + MindSpeed 0.16.0 + flash-linear-attention 0.4.2 + triton-ascend 3.2.1 + transformers 5.2.0
266270

267271
When running Qwen3.5 with Megatron-SWIFT on NPU, note the following version and feature constraints:
268272

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# NPU stability environment variables
2+
export HCCL_OP_BASE_FFTS_MODE_ENABLE=TRUE
3+
export MULTI_STREAM_MEMORY_REUSE=1
4+
# NPU memory management environment variables
5+
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
6+
# NPU performance environment variables
7+
export TASK_QUEUE_ENABLE=2
8+
9+
NPROC_PER_NODE=8 \
10+
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
11+
megatron sft \
12+
--model Qwen/Qwen3.5-35B-A3B \
13+
--save_safetensors true \
14+
--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
15+
'AI-ModelScope/alpaca-gpt4-data-en#500' \
16+
'swift/self-cognition#500' \
17+
--tuner_type lora \
18+
--lora_rank 8 \
19+
--lora_alpha 32 \
20+
--target_modules all-linear \
21+
\
22+
--tensor_model_parallel_size 2 \
23+
--expert_model_parallel_size 4 \
24+
--moe_permute_fusion true \
25+
--moe_grouped_gemm true \
26+
--moe_shared_expert_overlap true \
27+
--moe_aux_loss_coeff 1e-6 \
28+
--sequence_parallel true \
29+
--recompute_granularity full \
30+
--recompute_method uniform \
31+
--recompute_num_layers 1 \
32+
\
33+
--micro_batch_size 1 \
34+
--global_batch_size 8 \
35+
--finetune true \
36+
--cross_entropy_loss_fusion true \
37+
--gradient_accumulation_fusion false \
38+
--masked_softmax_fusion false \
39+
\
40+
--lr 1e-4 \
41+
--lr_warmup_fraction 0.05 \
42+
--min_lr 1e-5 \
43+
--num_train_epochs 16 \
44+
\
45+
--output_dir output/Qwen3.5-35B-A3B \
46+
--save_steps 2000 \
47+
--max_length 1024 \
48+
--system 'You are a helpful assistant.' \
49+
\
50+
--dataloader_num_workers 4 \
51+
--dataset_num_proc 4 \
52+
--no_save_optim true \
53+
--no_save_rng true \
54+
\
55+
--attention_backend flash \
56+
--model_author swift \
57+
--model_name swift-robot

0 commit comments

Comments
 (0)