Open
Description
What happened:
The HAMi device plugin detects my A30.
I start a pod with 1 gpu resource, the scheduler correctly assigns the GPU to the pod.
The pod starts and runs. When any GPU workload or nvidia-smi is triggered, the pod ist stuck at:
[HAMI-core Msg(672:140535013861184:libvgpu.c:837)]: Initializing.....
forever.
This is how I specified the GPU in my pod manifest:
resources: limits: nvidia.com/gpu: 1
What you expected to happen:
I expect that nvidia-smi and compute workloads are working inside the container that got a GPU assigned by HAMi.
How to reproduce it (as minimally and precisely as possible):
- A30 GPU, Compute Mode: Default, MIG Disabled
- containerd 2.0.4 (default runtime nvidia in config.toml)
- Kubernetes 1.32.3
- nvidia driver 570.86.15
- nvidia-container-toolkit 1.17.5
- cuda-toolkit 12.8
- Start a pod with resources
nvidia.com/gpu: 1
. - Exec into the pod with
kubectl exec -it pod/cuda-gpu-test -- /bin/bash
- Run
nvidia-smi
- Stuck at
[HAMI-core Msg(672:140535013861184:libvgpu.c:837)]: Initializing.....
Anything else we need to know?:
- The output of
nvidia-smi -a
on your host: nvidia-smi.txt - Your docker or containerd configuration file (e.g:
/etc/docker/daemon.json
): config.toml.log - The hami-device-plugin container logs: device-plugin.log
- The hami-scheduler container logs: scheduler.log
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
): kubelet.log - Any relevant kernel output lines from
dmesg
: dmesg.log
Environment:
- HAMi version: v2.5.0
- nvidia driver or other AI device driver version: 570.86.15
- Docker version from
docker version
=> using containerd 2.0.4 - Docker command, image and tag used
- Kernel version from
uname -a
:
Linux k8s-worker 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 15:53:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
- Others: See on "How to reproduce"