Node status changed to NotReady when GPU is allocated through DRA, node describe indicates some restarts of containerd,
full node describe: https://gist.github.com/kasia-kujawa/abd0db19c42179bff3134f3e96ae652a
Steps to reproduce:
- Install GPU Operator
- Install NVIDIA DRA DRIVER
- Create a Deployment that uses DRA allocation
---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test5
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
namespace: gpu-test5
name: timeslicing
spec:
devices:
requests:
- name: ts-gpu
deviceClassName: gpu.nvidia.com
config:
- requests: ["ts-gpu"]
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
kind: GpuConfig
sharing:
strategy: TimeSlicing
timeSlicingConfig:
interval: Long
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod0
namespace: gpu-test5
spec:
replicas: 3
selector:
matchLabels:
app: pod0
template:
metadata:
labels:
app: pod0
spec:
containers:
- name: ts-ctr0
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
command: ["bash", "-c"]
args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
resources:
claims:
- name: shared-gpus
request: ts-gpu
- name: ts-ctr1
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
command: ["bash", "-c"]
args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
resources:
claims:
- name: shared-gpus
request: ts-gpu
resourceClaims:
- name: shared-gpus
resourceClaimName: timeslicing
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
- Check that pods are running
- Observe GPU node status, after a while it will change to
NotReady
- After a while node will have
Ready status and then again change status to NodeReady
Environment:
GKE 1.33.2-gke.1240000, Image: UBUNTU_CONTAINERD
Helm versions:
helm ls -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator nvidia 1 2025-08-25 14:59:05.532858 +0200 CEST deployed gpu-operator-v25.3.2 v25.3.2
nvidia-dra-driver-gpu nvidia 1 2025-08-25 15:36:56.375659 +0200 CEST deployed nvidia-dra-driver-gpu-25.3.0-rc.4 25.3.0-rc.4
GPU Operator configuration ( workaround from NVIDIA/nvidia-container-toolkit#1222 (comment))
toolkit:
repository: ghcr.io/nvidia
image: container-toolkit
version: 8334ddec-ubuntu20.04
env:
- name: RUNTIME_CONFIG_SOURCES
value: "file"
NVIDIA DRA driver configuration:
resources.gpus.enabled: "true"
gpuResourcesEnabledOverride: "true"
nvidiaDriverRoot: /run/nvidia/driver
controller:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "nvidia.com/gpu.present"
operator: "Exists"
kubeletPlugin:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "nvidia.com/gpu.present"
operator: "Exists"
I see this issue also installing GPU drivers through https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml and NVIDIA Toolkit installation on the node using instruction from here https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#with-apt-ubuntu-debian
Node status changed to NotReady when GPU is allocated through DRA, node describe indicates some restarts of containerd,
full node describe: https://gist.github.com/kasia-kujawa/abd0db19c42179bff3134f3e96ae652a
Steps to reproduce:
NotReadyReadystatus and then again change status toNodeReadyEnvironment:
GKE 1.33.2-gke.1240000, Image: UBUNTU_CONTAINERD
Helm versions:
GPU Operator configuration ( workaround from NVIDIA/nvidia-container-toolkit#1222 (comment))
NVIDIA DRA driver configuration:
I see this issue also installing GPU drivers through https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml and NVIDIA Toolkit installation on the node using instruction from here https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#with-apt-ubuntu-debian