Describe the bug
When attempting to serve the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model using the vllm/vllm-openai:cu130-nightly Docker image, the server immediately crashes with an ImportError: libcudart.so.12: cannot open shared object file: No such file or directory.
This occurs because the nixl_ep library (used for Expert Parallelism in MoE models) is compiled against CUDA 12 and explicitly looks for the CUDA 12 runtime library. However, the cu130-nightly image only contains the CUDA 13 runtime, causing the import to fail.
Steps/Code to reproduce bug
Run the vLLM Docker container using the cu130-nightly tag.
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/raw/main/super_v3_reasoning_parser.py
docker run --rm -it --gpus all \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-e VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e HF_TOKEN=$HF_TOKEN \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $(pwd)/super_v3_reasoning_parser.py:/app/super_v3_reasoning_parser.py \
-p 8000:8000 \
vllm/vllm-openai:cu130-nightly \
--model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--served-model-name nemotron-3-super \
--host 0.0.0.0 \
--port 8000 \
--async-scheduling \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--data-parallel-size 1 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--max-num-seqs 4 \
--max-model-len 1000000 \
--moe-backend marlin \
--mamba_ssm_cache_dtype float32 \
--quantization fp4 \
--speculative_config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
--reasoning-parser-plugin /app/super_v3_reasoning_parser.py \
--reasoning-parser super_v3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Expected behavior
The vLLM server should start successfully and begin serving the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model without throwing dependency import errors related to the CUDA runtime environment.
Additional context
I encountered this issue on a DGX Spark while following the deployment instructions in the unreleased Spark Deployment Guide.
Traceback (most recent call last):
File "/usr/local/bin/vllm", line 4, in <module>
from vllm.entrypoints.cli.main import main
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 4, in <module>
from vllm.entrypoints.cli.benchmark.mm_processor import (
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/mm_processor.py", line 5, in <module>
from vllm.benchmarks.mm_processor import add_cli_args, main
File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/mm_processor.py", line 25, in <module>
from vllm.benchmarks.datasets import (
File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets/__init__.py", line 4, in <module>
from vllm.benchmarks.datasets.datasets import (
File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets/datasets.py", line 44, in <module>
from vllm.lora.utils import get_adapter_absolute_path
File "/usr/local/lib/python3.12/dist-packages/vllm/lora/utils.py", line 18, in <module>
from vllm.lora.layers import (
File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/__init__.py", line 4, in <module>
from vllm.lora.layers.column_parallel_linear import (
File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/column_parallel_linear.py", line 20, in <module>
from .base_linear import BaseLinearLayerWithLoRA
File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/base_linear.py", line 28, in <module>
from .utils import _get_lora_device
File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/utils.py", line 10, in <module>
from vllm.model_executor.layers.fused_moe.fused_moe import try_get_optimal_moe_config
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/__init__.py", line 19, in <module>
from vllm.model_executor.layers.fused_moe.layer import (
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 47, in <module>
from vllm.model_executor.layers.fused_moe.unquantized_fused_moe_method import (
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py", line 26, in <module>
from vllm.model_executor.layers.fused_moe.oracle.unquantized import (
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/oracle/unquantized.py", line 14, in <module>
from vllm.model_executor.layers.fused_moe.all2all_utils import (
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/all2all_utils.py", line 46, in <module>
from .nixl_ep_prepare_finalize import (
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/nixl_ep_prepare_finalize.py", line 5, in <module>
import nixl_ep
File "/usr/local/lib/python3.12/dist-packages/nixl_ep/__init__.py", line 23, in <module>
from . import nixl_ep_cpp as _nixl_ep_cpp
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
Describe the bug
When attempting to serve the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model using the vllm/vllm-openai:cu130-nightly Docker image, the server immediately crashes with an ImportError: libcudart.so.12: cannot open shared object file: No such file or directory.
This occurs because the nixl_ep library (used for Expert Parallelism in MoE models) is compiled against CUDA 12 and explicitly looks for the CUDA 12 runtime library. However, the cu130-nightly image only contains the CUDA 13 runtime, causing the import to fail.
Steps/Code to reproduce bug
Run the vLLM Docker container using the cu130-nightly tag.
Expected behavior
The vLLM server should start successfully and begin serving the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model without throwing dependency import errors related to the CUDA runtime environment.
Additional context
I encountered this issue on a DGX Spark while following the deployment instructions in the unreleased Spark Deployment Guide.