Skip to content

ImportError: libcudart.so.12 missing in CUDA 13 nightly image when running Nemotron-3-Super-120B-A12B-NVFP4 #150

@jinho2020

Description

@jinho2020

Describe the bug

When attempting to serve the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model using the vllm/vllm-openai:cu130-nightly Docker image, the server immediately crashes with an ImportError: libcudart.so.12: cannot open shared object file: No such file or directory.

This occurs because the nixl_ep library (used for Expert Parallelism in MoE models) is compiled against CUDA 12 and explicitly looks for the CUDA 12 runtime library. However, the cu130-nightly image only contains the CUDA 13 runtime, causing the import to fail.

Steps/Code to reproduce bug

Run the vLLM Docker container using the cu130-nightly tag.

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/raw/main/super_v3_reasoning_parser.py

docker run --rm -it --gpus all \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e HF_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v $(pwd)/super_v3_reasoning_parser.py:/app/super_v3_reasoning_parser.py \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
    --model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
    --served-model-name nemotron-3-super \
    --host 0.0.0.0 \
    --port 8000 \
    --async-scheduling \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --tensor-parallel-size 1 \
    --pipeline-parallel-size 1 \
    --data-parallel-size 1 \
    --trust-remote-code \
    --gpu-memory-utilization 0.90 \
    --enable-chunked-prefill \
    --max-num-seqs 4 \
    --max-model-len 1000000 \
    --moe-backend marlin \
    --mamba_ssm_cache_dtype float32 \
    --quantization fp4 \
    --speculative_config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}' \
    --reasoning-parser-plugin /app/super_v3_reasoning_parser.py \
    --reasoning-parser super_v3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder

Expected behavior

The vLLM server should start successfully and begin serving the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model without throwing dependency import errors related to the CUDA runtime environment.

Additional context

I encountered this issue on a DGX Spark while following the deployment instructions in the unreleased Spark Deployment Guide.

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/__init__.py", line 4, in <module>
    from vllm.entrypoints.cli.benchmark.mm_processor import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/benchmark/mm_processor.py", line 5, in <module>
    from vllm.benchmarks.mm_processor import add_cli_args, main
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/mm_processor.py", line 25, in <module>
    from vllm.benchmarks.datasets import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets/__init__.py", line 4, in <module>
    from vllm.benchmarks.datasets.datasets import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/benchmarks/datasets/datasets.py", line 44, in <module>
    from vllm.lora.utils import get_adapter_absolute_path
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/utils.py", line 18, in <module>
    from vllm.lora.layers import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/__init__.py", line 4, in <module>
    from vllm.lora.layers.column_parallel_linear import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/column_parallel_linear.py", line 20, in <module>
    from .base_linear import BaseLinearLayerWithLoRA
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/base_linear.py", line 28, in <module>
    from .utils import _get_lora_device
  File "/usr/local/lib/python3.12/dist-packages/vllm/lora/layers/utils.py", line 10, in <module>
    from vllm.model_executor.layers.fused_moe.fused_moe import try_get_optimal_moe_config
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/__init__.py", line 19, in <module>
    from vllm.model_executor.layers.fused_moe.layer import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 47, in <module>
    from vllm.model_executor.layers.fused_moe.unquantized_fused_moe_method import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py", line 26, in <module>
    from vllm.model_executor.layers.fused_moe.oracle.unquantized import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/oracle/unquantized.py", line 14, in <module>
    from vllm.model_executor.layers.fused_moe.all2all_utils import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/all2all_utils.py", line 46, in <module>
    from .nixl_ep_prepare_finalize import (
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/nixl_ep_prepare_finalize.py", line 5, in <module>
    import nixl_ep
  File "/usr/local/lib/python3.12/dist-packages/nixl_ep/__init__.py", line 23, in <module>
    from . import nixl_ep_cpp as _nixl_ep_cpp
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions