Your current environment
Hardware environment: Atlas 800I A2 with 64 GB of NPU card memory
Using image: quay.io/ascend/vllm-ascend:v0.12.0rc1-openeuler
🐛 Describe the bug
Command :
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
--tensor-parallel-size 4 --max-model-len 4096 \
--gpu-memory-utilization 0.85 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
Error:RuntimeError: NPU out of memory. Tried to allocate 2.00 GiB (NPU 0; 60.96 GiB total capacity; 55.91 GiB already allocated; 55.91 GiB current active; 1.80 GiB free; 58.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.