-
Notifications
You must be signed in to change notification settings - Fork 754
Open
Description
Bare metal K3S v1.33.3+k3s1 on kernel 6.15.11-2-MANJARO.
Not a new install; this had been stable for many months. Rebooted node with the GPU, now POD crash loops with message:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: ldcache error: process /sbin/ldconfig terminated with signal 9
I'm confused by the OCI, Legacy and ldcache references.
Chart reference in ArgoCD:
- repoURL: https://nvidia.github.io/k8s-device-plugin
chart: nvidia-device-plugin
targetRevision: 0.17.3
Helm Values File:
---
# yaml-language-server: $schema=https://json.schemastore.org/helmfile
nodeSelector:
nvidia.feature.node.kubernetes.io/gpu.3060: "true"
runtimeClassName: nvidia
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.feature.node.kubernetes.io/gpu.3060
operator: In
values:
- "true"
config:
map:
default: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 6
# Subcharts
nfd: {}
gfd:
enabled: false
- NFD is already installed via its own Helm Chart.
Current versions on host:
$ pacman -Q libnvidia-container
libnvidia-container 1.17.8-1
$ pacman -Q nvidia-container-toolkit
nvidia-container-toolkit 1.17.8-1
$ pacman -Q nvidia-utils
nvidia-utils 575.64.05-1
$ nvidia-smi
Tue Sep 2 15:33:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05 Driver Version: 575.64.05 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:09:00.0 On | N/A |
| 0% 54C P3 30W / 170W | 1665MiB / 12288MiB | 37% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
$ nvidia-container-cli info
NVRM version: 575.64.05
CUDA version: 12.9
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 3060
Brand: GeForce
GPU UUID: GPU-ace6a26d-6a78-9562-4fbc-69984c397347
Bus Location: 00000000:09:00.0
Architecture: 8.6
$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/libnvidia-ml.so.575.64.05
/usr/lib/libnvidia-cfg.so.575.64.05
/usr/lib/libcuda.so.575.64.05
/usr/lib/libcudadebugger.so.575.64.05
/usr/lib/libnvidia-gpucomp.so.575.64.05
/usr/lib/libnvidia-ptxjitcompiler.so.575.64.05
/usr/lib/libnvidia-allocator.so.575.64.05
/usr/lib/libnvidia-pkcs11.so.575.64.05
/usr/lib/libnvidia-pkcs11-openssl3.so.575.64.05
/usr/lib/libnvidia-nvvm.so.575.64.05
/usr/lib/libnvidia-ngx.so.575.64.05
/usr/lib/libnvidia-encode.so.575.64.05
/usr/lib/libnvidia-opticalflow.so.575.64.05
/usr/lib/libnvcuvid.so.575.64.05
/usr/lib/libnvidia-eglcore.so.575.64.05
/usr/lib/libnvidia-glcore.so.575.64.05
/usr/lib/libnvidia-tls.so.575.64.05
/usr/lib/libnvidia-glsi.so.575.64.05
/usr/lib/libnvidia-fbc.so.575.64.05
/usr/lib/libnvidia-rtcore.so.575.64.05
/usr/lib/libnvoptix.so.575.64.05
/usr/lib/libGLX_nvidia.so.575.64.05
/usr/lib/libEGL_nvidia.so.575.64.05
/usr/lib/libGLESv2_nvidia.so.575.64.05
/usr/lib/libGLESv1_CM_nvidia.so.575.64.05
/usr/lib/libnvidia-glvkspirv.so.575.64.05
/usr/lib32/libnvidia-ml.so.575.64.05
/usr/lib32/libcuda.so.575.64.05
/usr/lib32/libnvidia-gpucomp.so.575.64.05
/usr/lib32/libnvidia-ptxjitcompiler.so.575.64.05
/usr/lib32/libnvidia-allocator.so.575.64.05
/usr/lib32/libnvidia-encode.so.575.64.05
/usr/lib32/libnvidia-opticalflow.so.575.64.05
/usr/lib32/libnvcuvid.so.575.64.05
/usr/lib32/libnvidia-eglcore.so.575.64.05
/usr/lib32/libnvidia-glcore.so.575.64.05
/usr/lib32/libnvidia-tls.so.575.64.05
/usr/lib32/libnvidia-glsi.so.575.64.05
/usr/lib32/libnvidia-fbc.so.575.64.05
/usr/lib32/libGLX_nvidia.so.575.64.05
/usr/lib32/libEGL_nvidia.so.575.64.05
/usr/lib32/libGLESv2_nvidia.so.575.64.05
/usr/lib32/libGLESv1_CM_nvidia.so.575.64.05
/usr/lib32/libnvidia-glvkspirv.so.575.64.05
/lib/firmware/nvidia/575.64.05/gsp_ga10x.bin
/lib/firmware/nvidia/575.64.05/gsp_tu10x.bin
From K3S config:
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
runtime_type = "io.containerd.runc.v2"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi']
runtime_type = "io.containerd.runc.v2"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia-cdi'.options]
BinaryName = "/usr/bin/nvidia-container-runtime.cdi"
SystemdCgroup = true
$ k get all -n nvidia
NAME READY STATUS RESTARTS AGE
pod/nvidia-device-plugin-268bb 0/2 Init:CrashLoopBackOff 16 (50s ago) 60m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/nvidia-device-plugin 1 1 0 1 0 nvidia.feature.node.kubernetes.io/gpu.3060=true 60m
daemonset.apps/nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/mps.capable=true,nvidia.feature.node.kubernetes.io/gpu.3060=true 60m
Nvidia packages have not been updated recently on the host:
$ ls -ltR /var/cache/pacman/pkg/nvidia*.zst
.rw-r--r-- root root 80 KB Mon Aug 4 11:36:53 2025 /var/cache/pacman/pkg/nvidia-driver-assistant-0.22.65.06-1-any.pkg.tar.zst
.rw-r--r-- root root 334 MB Tue Jul 22 13:48:56 2025 /var/cache/pacman/pkg/nvidia-utils-575.64.05-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 334 MB Tue Jul 1 17:02:35 2025 /var/cache/pacman/pkg/nvidia-utils-575.64.03-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 334 MB Tue Jun 17 14:26:14 2025 /var/cache/pacman/pkg/nvidia-utils-575.64-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 79 KB Tue Jun 3 21:00:32 2025 /var/cache/pacman/pkg/nvidia-driver-assistant-0.21.57.08-1-any.pkg.tar.zst
.rw-r--r-- root root 4.3 MB Sun Jun 1 11:33:12 2025 /var/cache/pacman/pkg/nvidia-container-toolkit-1.17.8-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 79 KB Thu May 1 23:38:55 2025 /var/cache/pacman/pkg/nvidia-driver-assistant-0.21.51.03-1-any.pkg.tar.zst
.rw-r--r-- root root 4.3 MB Sat Apr 26 11:27:21 2025 /var/cache/pacman/pkg/nvidia-container-toolkit-1.17.6-1-x86_64.pkg.tar.zst
.rw-r--r-- root root 4.2 MB Thu Mar 13 11:17:51 2025 /var/cache/pacman/pkg/nvidia-container-toolkit-1.17.5-1-x86_64.pkg.tar.zst
Toni500github
Metadata
Metadata
Assignees
Labels
No labels