-
Notifications
You must be signed in to change notification settings - Fork 815
Open
Description
clientVersion:
buildDate: "2024-10-16T15:15:29Z"
compiler: gc
gitCommit: cbb86e0d7f4a049666fac0551e8b02ef3d6c3d9a
gitTreeState: clean
gitVersion: v1.27.16
goVersion: go1.22.5
major: "1"
minor: "27"
platform: linux/amd64
kustomizeVersion: v5.0.1
serverVersion:
buildDate: "2024-10-16T15:16:32Z"
compiler: gc
gitCommit: cbb86e0d7f4a049666fac0551e8b02ef3d6c3d9a
gitTreeState: clean
gitVersion: v1.27.16
goVersion: go1.22.5
major: "1"
minor: "27"
platform: linux/amd64
Summary
gpu-operator-node-feature-discovery-worker-lchx9 1/1 Running 0 2m19s
gpu-operator-76998cd846-cwfrv 1/1 Running 0 2m19s
gpu-operator-node-feature-discovery-master-6fbc745786-l828r 1/1 Running 0 2m19s
nvidia-operator-validator-hhgp7 0/1 Init:CrashLoopBackOff 4 (34s ago) 119s
nvidia-dcgm-exporter-zsjdz 0/1 Init:CrashLoopBackOff 4 (29s ago) 118s
nvidia-container-toolkit-daemonset-rxrx9 0/1 Init:CrashLoopBackOff 4 (29s ago) 119s
gpu-feature-discovery-s7lsd 0/1 Init:CrashLoopBackOff 4 (19s ago) 118s
nvidia-device-plugin-daemonset-qz9rq 0/1 Init:CrashLoopBackOff 4 (19s ago) 119s
Warning Failed 3m5s (x4 over 3m45s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
Warning BackOff 3m5s (x5 over 3m44s) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-device-plugin-daemonset-mcn5z_gpu-operator-resources(73d22052-9175-4d74-9cbb-71367483b3f3)
Warning FailedMount 2m11s kubelet MountVolume.SetUp failed for volume "run-nvidia" : hostPath type check failed: /run/nvidia is not a directory
Normal SandboxChanged 2m5s kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 29s (x4 over 2m2s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1" already present on machine
Normal Created 29s (x4 over 2m2s) kubelet Created container toolkit-validation
Warning Failed 29s (x4 over 2m1s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
Warning BackOff 6s (x12 over 2m1s) kubelet Back-off restarting failed container toolkit-validation in pod nvidia-device-plugin-daemonset-mcn5z_gpu-operator-resources(73d22052-9175-4d74-9cbb-71367483b3f3)
What Should Happen Instead?
No GPU operator resources error
Reproduction Steps
- ...
- ...
Introspection Report
inspection-report-20250923_152230.tar.gz
Can you suggest a fix?
Are you interested in contributing with a fix?
no
Metadata
Metadata
Assignees
Labels
No labels