-
Notifications
You must be signed in to change notification settings - Fork 94
Open
Labels
bugIssue/PR to expose/discuss/fix a bugIssue/PR to expose/discuss/fix a bug
Milestone
Description
Description
After installing the NVIDIA DRA driver using Helm on a KIND cluster, the nvidia-dra-driver-gpu-kubelet-plugin pod remains stuck in Init:0/1 status indefinitely. The init container continuously retries and fails to find the NVIDIA ML library (libnvidia-ml.so.1) despite the NVIDIA driver being present on the node.
Environment
- Cluster Type: KIND (Kubernetes in Docker)
- Cluster Name: kind-dra-1
- OS: Linux 6.12.53-1-lts
- NVIDIA Driver Version: 580.95.05
- GPU: NVIDIA GeForce RTX 4060
- DRA Driver Version: v25.8.0-dev
- Namespace: nvidia-dra-driver-gpu
Steps to Reproduce
-
Set up KIND cluster:
export KIND_CLUSTER_NAME="kind-dra-1" ./demo/clusters/kind/create-cluster.sh
-
Install NVIDIA DRA driver via Helm:
./demo/clusters/kind/install-dra-driver-gpu.sh
-
Check pod status:
kubectl get pods -n nvidia-dra-driver-gpu -w
Observed Behavior
NAME READY STATUS RESTARTS AGE
nvidia-dra-driver-gpu-controller-b65d7c4d9-rbr7t 1/1 Running 0 10m
nvidia-dra-driver-gpu-kubelet-plugin-qsp2x 0/2 Init:0/1 0 10m
The kubelet-plugin pod remains stuck with:
- Status:
Init:0/1 - Ready:
0/2 - Init Container: Running but never completes
Error Logs
Init container logs show continuous retry attempts:
create symlink: /driver-root -> /driver-root-parent/
2025-10-18T01:31:55Z /driver-root (/ on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: not found, current contents: [].
Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.
2025-10-18T01:32:05Z /driver-root (/ on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: not found, current contents: [].
2025-10-18T01:32:15Z /driver-root (/ on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: not found, current contents: [].
riccardo32
Metadata
Metadata
Assignees
Labels
bugIssue/PR to expose/discuss/fix a bugIssue/PR to expose/discuss/fix a bug
Type
Projects
Status
Backlog