Skip to content

vllm container crashes on standup #724

@shashwatj07

Description

@shashwatj07

Component

I don't know

Describe the bug

Error upon launching vllm inside containers:

kubectl logs <prefill/decode_pod_name>
(EngineCore_DP0 pid=316) Process EngineCore_DP0:                                                                                       
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] EngineCore failed to start.                                                
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] Traceback (most recent call last):                                         
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm-source/vllm/v1/engine/core.py", line 927, in run_engine_c
ore                                                                                                                                    
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)    
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm-source/vllm/v1/engine/core.py", line 692, in __init__    
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     super().__init__(                                                      
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm-source/vllm/v1/engine/core.py", line 106, in __init__    
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     self.model_executor = executor_class(vllm_config)                      
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^                      
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm-source/vllm/v1/executor/abstract.py", line 101, in __init
__
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     self._init_executor()
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm-source/vllm/v1/executor/uniproc_executor.py", line 47, in
 _init_executor
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     self.driver_worker.init_device()
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm-source/vllm/v1/worker/worker_base.py", line 326, in init_
device
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     self.worker.init_device()  # type: ignore
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm-source/vllm/v1/worker/gpu_worker.py", line 209, in init_d
evice
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     current_platform.set_device(self.device)
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm-source/vllm/platforms/cuda.py", line 123, in set_device
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     torch.cuda.set_device(device)
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py", l
ine 567, in set_device
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     torch._C._cuda_setDevice(device)
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]   File "/opt/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py", l
ine 410, in _lazy_init
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936]     torch._C._cuda_init()
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some 
cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver /
 cuda driver combination

Steps to reproduce

./scripts/standup.sh -c gpu

Additional context or screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions