-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What happened:
Trying to deploy LLM model using vllm library as a pod in EKS 1.34 cluster p5en.48xlarge type managed node. Nvidia device plugin (v0.18.0) deployed as a daemonset. The model container is failing with runtime errors.
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
The same docker image with same 1.34 cluster and same instance type worked in the one aws account with AMI amazon-eks-node-al2023-x86_64-nvidia-1.34-v20251023 and when we try with same infra but with amazon-eks-node-al2023-x86_64-nvidia-1.34-v20251108 version of AMI in another AWS account facing this error.
Working version: amazon-eks-node-al2023-x86_64-nvidia-1.34-v20251023
Problematic version: amazon-eks-node-al2023-x86_64-nvidia-1.34-v20251108
Logs:
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] WorkerProc failed to start.
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] Traceback (most recent call last):
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] worker = WorkerProc(*args, **kwargs)
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 405, in init
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] wrapper.init_worker(all_kwargs)
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 212, in init_worker
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] worker_class = resolve_obj_by_qualname(
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/utils/init.py", line 2680, in resolve_obj_by_qualname
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] module = importlib.import_module(module_name)
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/importlib/init.py", line 126, in import_module
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] return _bootstrap._gcd_import(name[level:], package, level)
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "", line 1204, in _gcd_import
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "", line 1176, in _find_and_load
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "", line 1147, in _find_and_load_unlocked
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "", line 690, in _load_unlocked
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "", line 940, in exec_module
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "", line 241, in _call_with_frames_removed
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/v1/worker/gpu_worker.py", line 34, in
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] from vllm.v1.worker.gpu_model_runner import GPUModelRunner
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/v1/worker/gpu_model_runner.py", line 69, in
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] from vllm.v1.attention.backends.flash_attn import AttentionMetadata
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/v1/attention/backends/flash_attn.py", line 154, in
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] class FlashAttentionMetadataBuilder(
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/v1/attention/backends/flash_attn.py", line 175, in FlashAttentionMetadataBuilder
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] if get_flash_attn_version() == 3 else AttentionCGSupport.UNIFORM_BATCH
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/attention/utils/fa_utils.py", line 37, in get_flash_attn_version
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] and is_fa_version_supported(3)) else 2
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 57, in is_fa_version_supported
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] return _is_fa3_supported(device)[0]
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 43, in _is_fa3_supported
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] if torch.cuda.get_device_capability(device)[0] < 8
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/torch/cuda/init.py", line 600, in get_device_capability
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] prop = get_device_properties(device)
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/torch/cuda/init.py", line 616, in get_device_properties
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] _lazy_init() # will define _get_device_properties
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] ^^^^^^^^^^^^
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] File "/usr/local/lib/python3.11/site-packages/torch/cuda/init.py", line 412, in _lazy_init
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] torch._C._cuda_init()
chat-model ERROR 12-04 16:10:33 [multiproc_executor.py:597] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized
What you expected to happen:
The latest version of AMI should work.
How to reproduce it (as minimally and precisely as possible):
Environment:
- AWS Region: US-East-1
- Instance Type(s): p5en.48xlarge
- Cluster Kubernetes version: 1.34
- Node Kubernetes version: 1.34
- AMI Version:
Working version: amazon-eks-node-al2023-x86_64-nvidia-1.34-v20251023
Problematic version: amazon-eks-node-al2023-x86_64-nvidia-1.34-v20251108