Skip to content

GPU operator fails to start up. gpu-operator-resources errors. #5247

@khteh

Description

@khteh
clientVersion:
  buildDate: "2024-10-16T15:15:29Z"
  compiler: gc
  gitCommit: cbb86e0d7f4a049666fac0551e8b02ef3d6c3d9a
  gitTreeState: clean
  gitVersion: v1.27.16
  goVersion: go1.22.5
  major: "1"
  minor: "27"
  platform: linux/amd64
kustomizeVersion: v5.0.1
serverVersion:
  buildDate: "2024-10-16T15:16:32Z"
  compiler: gc
  gitCommit: cbb86e0d7f4a049666fac0551e8b02ef3d6c3d9a
  gitTreeState: clean
  gitVersion: v1.27.16
  goVersion: go1.22.5
  major: "1"
  minor: "27"
  platform: linux/amd64

Summary

gpu-operator-node-feature-discovery-worker-lchx9              1/1     Running                 0             2m19s
gpu-operator-76998cd846-cwfrv                                 1/1     Running                 0             2m19s
gpu-operator-node-feature-discovery-master-6fbc745786-l828r   1/1     Running                 0             2m19s
nvidia-operator-validator-hhgp7                               0/1     Init:CrashLoopBackOff   4 (34s ago)   119s
nvidia-dcgm-exporter-zsjdz                                    0/1     Init:CrashLoopBackOff   4 (29s ago)   118s
nvidia-container-toolkit-daemonset-rxrx9                      0/1     Init:CrashLoopBackOff   4 (29s ago)   119s
gpu-feature-discovery-s7lsd                                   0/1     Init:CrashLoopBackOff   4 (19s ago)   118s
nvidia-device-plugin-daemonset-qz9rq                          0/1     Init:CrashLoopBackOff   4 (19s ago)   119s
  Warning  Failed     3m5s (x4 over 3m45s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
  Warning  BackOff         3m5s (x5 over 3m44s)  kubelet  Back-off restarting failed container toolkit-validation in pod nvidia-device-plugin-daemonset-mcn5z_gpu-operator-resources(73d22052-9175-4d74-9cbb-71367483b3f3)
  Warning  FailedMount     2m11s                 kubelet  MountVolume.SetUp failed for volume "run-nvidia" : hostPath type check failed: /run/nvidia is not a directory
  Normal   SandboxChanged  2m5s                  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          29s (x4 over 2m2s)    kubelet  Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1" already present on machine
  Normal   Created         29s (x4 over 2m2s)    kubelet  Created container toolkit-validation
  Warning  Failed          29s (x4 over 2m1s)    kubelet  Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
  Warning  BackOff  6s (x12 over 2m1s)  kubelet  Back-off restarting failed container toolkit-validation in pod nvidia-device-plugin-daemonset-mcn5z_gpu-operator-resources(73d22052-9175-4d74-9cbb-71367483b3f3)

What Should Happen Instead?

No GPU operator resources error

Reproduction Steps

  1. ...
  2. ...

Introspection Report

inspection-report-20250923_152230.tar.gz

Can you suggest a fix?

Are you interested in contributing with a fix?

no

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions