Skip to content

[Bug]: Encounter an out of memory error with the tutorial of V0.12.0RC1 for Qwen3-Next #5020

@Aurorahsy

Description

@Aurorahsy

Your current environment

Hardware environment: Atlas 800I A2 with 64 GB of NPU card memory
Using image: quay.io/ascend/vllm-ascend:v0.12.0rc1-openeuler

🐛 Describe the bug

Command :

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
--tensor-parallel-size 4 --max-model-len 4096 \
--gpu-memory-utilization 0.85 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

Error:RuntimeError: NPU out of memory. Tried to allocate 2.00 GiB (NPU 0; 60.96 GiB total capacity; 55.91 GiB already allocated; 55.91 GiB current active; 1.80 GiB free; 58.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions