[Bug]: Encounter an out of memory error with the tutorial of V0.12.0RC1 for Qwen3-Next

### Your current environment

Hardware environment: Atlas 800I A2 with 64 GB of NPU card memory
Using image: quay.io/ascend/vllm-ascend:v0.12.0rc1-openeuler

### 🐛 Describe the bug

Command :
```
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
--tensor-parallel-size 4 --max-model-len 4096 \
--gpu-memory-utilization 0.85 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
```
Error:RuntimeError: NPU out of memory. Tried to allocate 2.00 GiB (NPU 0; 60.96 GiB total capacity; 55.91 GiB already allocated; 55.91 GiB current active; 1.80 GiB free; 58.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Encounter an out of memory error with the tutorial of V0.12.0RC1 for Qwen3-Next #5020

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Encounter an out of memory error with the tutorial of V0.12.0RC1 for Qwen3-Next #5020

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions