-
Notifications
You must be signed in to change notification settings - Fork 52
Open
Description
Component
I don't know
Describe the bug
Error upon launching vllm inside containers:
kubectl logs <prefill/decode_pod_name>
(EngineCore_DP0 pid=316) Process EngineCore_DP0:
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm-source/vllm/v1/engine/core.py", line 927, in run_engine_c
ore
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm-source/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] super().__init__(
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm-source/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm-source/vllm/v1/executor/abstract.py", line 101, in __init
__
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] self._init_executor()
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm-source/vllm/v1/executor/uniproc_executor.py", line 47, in
_init_executor
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] self.driver_worker.init_device()
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm-source/vllm/v1/worker/worker_base.py", line 326, in init_
device
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] self.worker.init_device() # type: ignore
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm-source/vllm/v1/worker/gpu_worker.py", line 209, in init_d
evice
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] current_platform.set_device(self.device)
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm-source/vllm/platforms/cuda.py", line 123, in set_device
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] torch.cuda.set_device(device)
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py", l
ine 567, in set_device
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] torch._C._cuda_setDevice(device)
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] File "/opt/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py", l
ine 410, in _lazy_init
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] torch._C._cuda_init()
(EngineCore_DP0 pid=316) ERROR 02-24 00:11:22 [core.py:936] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some
cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver /
cuda driver combination
Steps to reproduce
./scripts/standup.sh -c gpu
Additional context or screenshots
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels