Stuck at HAMI-core Initializing..... forever - pod schedule success

**What happened**:
The HAMi device plugin detects my A30.
I start a pod with 1 gpu resource, the scheduler correctly assigns the GPU to the pod.
The pod starts and runs. When any GPU workload or nvidia-smi is triggered, the pod ist stuck at:

`[HAMI-core Msg(672:140535013861184:libvgpu.c:837)]: Initializing.....` 

forever.

This is how I specified the GPU in my pod manifest:
`
    resources:
      limits:
        nvidia.com/gpu: 1
`

**What you expected to happen**:
I expect that nvidia-smi and compute workloads are working inside the container that got a GPU assigned by HAMi.

**How to reproduce it (as minimally and precisely as possible)**:
- A30 GPU, Compute Mode: Default, MIG Disabled
- containerd 2.0.4 (default runtime nvidia in config.toml)
- Kubernetes 1.32.3
- nvidia driver 570.86.15
- nvidia-container-toolkit 1.17.5
- cuda-toolkit 12.8

1. Start a pod with resources `nvidia.com/gpu: 1`. 
2. Exec into the pod with `kubectl exec -it pod/cuda-gpu-test -- /bin/bash`
3. Run `nvidia-smi`
4. Stuck at `[HAMI-core Msg(672:140535013861184:libvgpu.c:837)]: Initializing.....`

**Anything else we need to know?**:
- The output of `nvidia-smi -a` on your host: [nvidia-smi.txt](https://github.com/user-attachments/files/19773182/nvidia-smi.txt)
- Your docker or containerd configuration file (e.g: `/etc/docker/daemon.json`): [config.toml.log](https://github.com/user-attachments/files/19773147/config.toml.log)
- The hami-device-plugin container logs: [device-plugin.log](https://github.com/user-attachments/files/19773259/device-plugin.log)
- The hami-scheduler container logs: [scheduler.log](https://github.com/user-attachments/files/19773258/scheduler.log)
- The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`): [kubelet.log](https://github.com/user-attachments/files/19773254/kubelet.log)
- Any relevant kernel output lines from `dmesg`: [dmesg.log](https://github.com/user-attachments/files/19773256/dmesg.log)

**Environment**:
- HAMi version: v2.5.0
- nvidia driver or other AI device driver version: 570.86.15
- Docker version from `docker version` => using containerd 2.0.4
- Docker command, image and tag used
- Kernel version from `uname -a`:
`Linux k8s-worker 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 15:53:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux`
- Others: See on  "How to reproduce"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stuck at HAMI-core Initializing..... forever - pod schedule success #1010

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stuck at HAMI-core Initializing..... forever - pod schedule success #1010

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions