mps-control-daemon is restarting when GPU drivers are installed through GKE daemonset, tested with example Pod from https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/demo/specs/quickstart/gpu-test-mps.yaml
Kuberentes version: 1.33.2-gke.1240000
Image type: UBUNTU_CONTAINERD
logs:
$ kubectl logs -n nvidia mps-control-daemon-d8478efa-5c09-461d-b7b9-f59f320396a8-5cv5nr8 -p
chroot: failed to run command 'sh': No such file or directory
GPU driver installation:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
DRA helm chart installation:
cat <<EOF > dra_values.yaml
resources.gpus.enabled: "true"
gpuResourcesEnabledOverride: "true"
nvidiaDriverRoot: /opt/nvidia
controller:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "nvidia.com/gpu.present"
operator: "Exists"
kubeletPlugin:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "nvidia.com/gpu.present"
operator: "Exists"
EOF
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.0-rc.4" \
--namespace nvidia \
-f dra_values.yaml
mps-control-daemon is restarting when GPU drivers are installed through GKE daemonset, tested with example Pod from https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/demo/specs/quickstart/gpu-test-mps.yaml
Kuberentes version:
1.33.2-gke.1240000Image type:
UBUNTU_CONTAINERDlogs:
GPU driver installation:
DRA helm chart installation: