Your current environment
Bug Summary
When running Qwen3.6-35B with MTP speculative decoding (method: mtp) on a 128 GB Unified Memory system (NVIDIA GB10 Blackwell), vLLM incorrectly calculates the CUDA graph memory as a massive negative value (-35.69 GiB). This causes the memory profiler to artificially inflate the available KV cache pool. At a standard --gpu-memory-utilization of 0.8, this leads to an immediate OOM. Lowering the utilization to 0.55 acts as a workaround, but still allocates a suspicious 72 GiB KV cache.
Environment:
GPU: NVIDIA DGX Spark GB10 (128 GB Unified Memory)
CUDA Version: 13.2.0
vLLM Version: 0.22.1rc1.dev23+g6bdabbad5.d20260531
Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 (KV-Cache dtype: fp8)
Command Line Arguments:
Bash
python3 -m vllm.entrypoints.openai.api_server \
--model RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.55 \
--enable-chunked-prefill \
--kv-cache-dtype fp8 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'
Relevant Log Output:
Plaintext
(EngineCore pid=257) WARNING 06-06 16:17:06 [compilation.py:1407] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend... setting cudagraph_mode=PIECEWISE
(EngineCore pid=257) INFO 06-06 16:17:06 [gpu_model_runner.py:6283] Profiling CUDA graph memory: PIECEWISE=51 (largest=512)
(EngineCore pid=257) INFO 06-06 16:17:13 [gpu_model_runner.py:6369] Estimated CUDA graph memory: -35.69 GiB total
(EngineCore pid=257) INFO 06-06 16:17:14 [gpu_worker.py:469] Available KV cache memory: 72.44 GiB
Expected Behavior:
The CUDA graph memory profiling should return a positive value, properly subtracting from the total available GPU memory instead of causing an underflow/negative estimation that inflates the KV cache block allocator.
Steps to Reproduce:
Load a >30B parameter model using fp8 kv-cache.
Enable chunked prefill.
Enable MTP speculative decoding with num_speculative_tokens: 3.
Observe the Estimated CUDA graph memory in the startup logs.
🐛 Describe the bug
Bug Summary
When running Qwen3.6-35B with MTP speculative decoding (method: mtp) on a 128 GB Unified Memory system (NVIDIA GB10 Blackwell), vLLM incorrectly calculates the CUDA graph memory as a massive negative value (-35.69 GiB). This causes the memory profiler to artificially inflate the available KV cache pool. At a standard --gpu-memory-utilization of 0.8, this leads to an immediate OOM. Lowering the utilization to 0.55 acts as a workaround, but still allocates a suspicious 72 GiB KV cache.
Environment:
GPU: NVIDIA DGX Spark GB10 (128 GB Unified Memory)
CUDA Version: 13.2.0
vLLM Version: 0.22.1rc1.dev23+g6bdabbad5.d20260531
Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 (KV-Cache dtype: fp8)
Command Line Arguments:
Bash
python3 -m vllm.entrypoints.openai.api_server \
--model RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.55 \
--enable-chunked-prefill \
--kv-cache-dtype fp8 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'
Relevant Log Output:
Plaintext
(EngineCore pid=257) WARNING 06-06 16:17:06 [compilation.py:1407] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend... setting cudagraph_mode=PIECEWISE
(EngineCore pid=257) INFO 06-06 16:17:06 [gpu_model_runner.py:6283] Profiling CUDA graph memory: PIECEWISE=51 (largest=512)
(EngineCore pid=257) INFO 06-06 16:17:13 [gpu_model_runner.py:6369] Estimated CUDA graph memory: -35.69 GiB total
(EngineCore pid=257) INFO 06-06 16:17:14 [gpu_worker.py:469] Available KV cache memory: 72.44 GiB
Expected Behavior:
The CUDA graph memory profiling should return a positive value, properly subtracting from the total available GPU memory instead of causing an underflow/negative estimation that inflates the KV cache block allocator.
Steps to Reproduce:
Load a >30B parameter model using fp8 kv-cache.
Enable chunked prefill.
Enable MTP speculative decoding with num_speculative_tokens: 3.
Observe the Estimated CUDA graph memory in the startup logs.
### Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
Your current environment
Bug Summary
When running Qwen3.6-35B with MTP speculative decoding (method: mtp) on a 128 GB Unified Memory system (NVIDIA GB10 Blackwell), vLLM incorrectly calculates the CUDA graph memory as a massive negative value (-35.69 GiB). This causes the memory profiler to artificially inflate the available KV cache pool. At a standard --gpu-memory-utilization of 0.8, this leads to an immediate OOM. Lowering the utilization to 0.55 acts as a workaround, but still allocates a suspicious 72 GiB KV cache.
Environment:
GPU: NVIDIA DGX Spark GB10 (128 GB Unified Memory)
CUDA Version: 13.2.0
vLLM Version: 0.22.1rc1.dev23+g6bdabbad5.d20260531
Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 (KV-Cache dtype: fp8)
Command Line Arguments:
Relevant Log Output:
Expected Behavior:
The CUDA graph memory profiling should return a positive value, properly subtracting from the total available GPU memory instead of causing an underflow/negative estimation that inflates the KV cache block allocator.
Steps to Reproduce:
Load a >30B parameter model using fp8 kv-cache.
Enable chunked prefill.
Enable MTP speculative decoding with num_speculative_tokens: 3.
Observe the Estimated CUDA graph memory in the startup logs.
🐛 Describe the bug
Bug Summary
When running Qwen3.6-35B with MTP speculative decoding (method: mtp) on a 128 GB Unified Memory system (NVIDIA GB10 Blackwell), vLLM incorrectly calculates the CUDA graph memory as a massive negative value (-35.69 GiB). This causes the memory profiler to artificially inflate the available KV cache pool. At a standard --gpu-memory-utilization of 0.8, this leads to an immediate OOM. Lowering the utilization to 0.55 acts as a workaround, but still allocates a suspicious 72 GiB KV cache.
Environment:
GPU: NVIDIA DGX Spark GB10 (128 GB Unified Memory)
CUDA Version: 13.2.0
vLLM Version: 0.22.1rc1.dev23+g6bdabbad5.d20260531
Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 (KV-Cache dtype: fp8)
Command Line Arguments:
Relevant Log Output:
Expected Behavior:
The CUDA graph memory profiling should return a positive value, properly subtracting from the total available GPU memory instead of causing an underflow/negative estimation that inflates the KV cache block allocator.