Skip to content

NVIDIA DRA Driver Kubelet Plugin Pod Stuck in Init:0/1 Status #692

@nmn3m

Description

@nmn3m

Description

After installing the NVIDIA DRA driver using Helm on a KIND cluster, the nvidia-dra-driver-gpu-kubelet-plugin pod remains stuck in Init:0/1 status indefinitely. The init container continuously retries and fails to find the NVIDIA ML library (libnvidia-ml.so.1) despite the NVIDIA driver being present on the node.

Environment

  • Cluster Type: KIND (Kubernetes in Docker)
  • Cluster Name: kind-dra-1
  • OS: Linux 6.12.53-1-lts
  • NVIDIA Driver Version: 580.95.05
  • GPU: NVIDIA GeForce RTX 4060
  • DRA Driver Version: v25.8.0-dev
  • Namespace: nvidia-dra-driver-gpu

Steps to Reproduce

  1. Set up KIND cluster:

    export KIND_CLUSTER_NAME="kind-dra-1"
    ./demo/clusters/kind/create-cluster.sh
  2. Install NVIDIA DRA driver via Helm:

    ./demo/clusters/kind/install-dra-driver-gpu.sh
  3. Check pod status:

    kubectl get pods -n nvidia-dra-driver-gpu -w

Observed Behavior

NAME                                               READY   STATUS     RESTARTS   AGE
nvidia-dra-driver-gpu-controller-b65d7c4d9-rbr7t   1/1     Running    0          10m
nvidia-dra-driver-gpu-kubelet-plugin-qsp2x         0/2     Init:0/1   0          10m

The kubelet-plugin pod remains stuck with:

  • Status: Init:0/1
  • Ready: 0/2
  • Init Container: Running but never completes

Error Logs

Init container logs show continuous retry attempts:

create symlink: /driver-root -> /driver-root-parent/
2025-10-18T01:31:55Z  /driver-root (/ on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: not found, current contents: [].

Check failed. Has the NVIDIA GPU driver been set up? It is expected to be installed under NVIDIA_DRIVER_ROOT (currently set to '/') in the host filesystem. If that path appears to be unexpected: review the DRA driver's 'nvidiaDriverRoot' Helm chart variable. Otherwise, review if the GPU driver has actually been installed under that path.

2025-10-18T01:32:05Z  /driver-root (/ on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: not found, current contents: [].
2025-10-18T01:32:15Z  /driver-root (/ on host): nvidia-smi: '/driver-root/usr/bin/nvidia-smi', libnvidia-ml.so.1: not found, current contents: [].

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type

Projects

Status

Backlog

Relationships

None yet

Development

No branches or pull requests

Issue actions