Skip to content

[Bug]: Negative CUDA graph memory estimation (-35 GiB) with MTP speculative decoding leads to severe KV cache over-allocation and OOM #44740

@sto1

Description

@sto1

Your current environment

Bug Summary
When running Qwen3.6-35B with MTP speculative decoding (method: mtp) on a 128 GB Unified Memory system (NVIDIA GB10 Blackwell), vLLM incorrectly calculates the CUDA graph memory as a massive negative value (-35.69 GiB). This causes the memory profiler to artificially inflate the available KV cache pool. At a standard --gpu-memory-utilization of 0.8, this leads to an immediate OOM. Lowering the utilization to 0.55 acts as a workaround, but still allocates a suspicious 72 GiB KV cache.

Environment:

GPU: NVIDIA DGX Spark GB10 (128 GB Unified Memory)

CUDA Version: 13.2.0

vLLM Version: 0.22.1rc1.dev23+g6bdabbad5.d20260531

Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 (KV-Cache dtype: fp8)

Command Line Arguments:

Bash
python3 -m vllm.entrypoints.openai.api_server \
  --model RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.55 \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

Relevant Log Output:

Plaintext
(EngineCore pid=257) WARNING 06-06 16:17:06 [compilation.py:1407] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend... setting cudagraph_mode=PIECEWISE
(EngineCore pid=257) INFO 06-06 16:17:06 [gpu_model_runner.py:6283] Profiling CUDA graph memory: PIECEWISE=51 (largest=512)
(EngineCore pid=257) INFO 06-06 16:17:13 [gpu_model_runner.py:6369] Estimated CUDA graph memory: -35.69 GiB total
(EngineCore pid=257) INFO 06-06 16:17:14 [gpu_worker.py:469] Available KV cache memory: 72.44 GiB

Expected Behavior:
The CUDA graph memory profiling should return a positive value, properly subtracting from the total available GPU memory instead of causing an underflow/negative estimation that inflates the KV cache block allocator.

Steps to Reproduce:

Load a >30B parameter model using fp8 kv-cache.

Enable chunked prefill.

Enable MTP speculative decoding with num_speculative_tokens: 3.

Observe the Estimated CUDA graph memory in the startup logs.

🐛 Describe the bug

Bug Summary
When running Qwen3.6-35B with MTP speculative decoding (method: mtp) on a 128 GB Unified Memory system (NVIDIA GB10 Blackwell), vLLM incorrectly calculates the CUDA graph memory as a massive negative value (-35.69 GiB). This causes the memory profiler to artificially inflate the available KV cache pool. At a standard --gpu-memory-utilization of 0.8, this leads to an immediate OOM. Lowering the utilization to 0.55 acts as a workaround, but still allocates a suspicious 72 GiB KV cache.

Environment:

GPU: NVIDIA DGX Spark GB10 (128 GB Unified Memory)

CUDA Version: 13.2.0

vLLM Version: 0.22.1rc1.dev23+g6bdabbad5.d20260531

Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 (KV-Cache dtype: fp8)

Command Line Arguments:

Bash
python3 -m vllm.entrypoints.openai.api_server \
  --model RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.55 \
  --enable-chunked-prefill \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

Relevant Log Output:

Plaintext
(EngineCore pid=257) WARNING 06-06 16:17:06 [compilation.py:1407] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend... setting cudagraph_mode=PIECEWISE
(EngineCore pid=257) INFO 06-06 16:17:06 [gpu_model_runner.py:6283] Profiling CUDA graph memory: PIECEWISE=51 (largest=512)
(EngineCore pid=257) INFO 06-06 16:17:13 [gpu_model_runner.py:6369] Estimated CUDA graph memory: -35.69 GiB total
(EngineCore pid=257) INFO 06-06 16:17:14 [gpu_worker.py:469] Available KV cache memory: 72.44 GiB

Expected Behavior:
The CUDA graph memory profiling should return a positive value, properly subtracting from the total available GPU memory instead of causing an underflow/negative estimation that inflates the KV cache block allocator.


Steps to Reproduce:

Load a >30B parameter model using fp8 kv-cache.

Enable chunked prefill.

Enable MTP speculative decoding with num_speculative_tokens: 3.

Observe the Estimated CUDA graph memory in the startup logs.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions