Node status changed to NotReady when GPU is allocated through DRA

Node status changed to NotReady when GPU is allocated through DRA, node describe indicates some restarts of containerd,
full node describe:  https://gist.github.com/kasia-kujawa/abd0db19c42179bff3134f3e96ae652a

**Steps to reproduce:**
1. Install GPU Operator
2. Install NVIDIA DRA DRIVER
3. Create a Deployment that uses DRA allocation
```

---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test5

---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
  namespace: gpu-test5
  name: timeslicing
spec:
  devices:
    requests:
    - name: ts-gpu
      deviceClassName: gpu.nvidia.com
    config:
    - requests: ["ts-gpu"]
      opaque:
        driver: gpu.nvidia.com
        parameters:
          apiVersion: resource.nvidia.com/v1beta1
          kind: GpuConfig
          sharing:
            strategy: TimeSlicing
            timeSlicingConfig:
              interval: Long
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod0
  namespace: gpu-test5
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pod0
  template:
    metadata:
      labels:
        app: pod0
    spec:
      containers:
        - name: ts-ctr0
          image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
          command: ["bash", "-c"]
          args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
          resources:
            claims:
              - name: shared-gpus
                request: ts-gpu
        - name: ts-ctr1
          image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
          command: ["bash", "-c"]
          args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
          resources:
            claims:
              - name: shared-gpus
                request: ts-gpu
      resourceClaims:
        - name: shared-gpus
          resourceClaimName: timeslicing
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

```
4. Check that pods are running
5. Observe GPU node status, after a while it will change to `NotReady`
6. After a while node will have `Ready` status and then again change status to `NodeReady`

**Environment:**
GKE 1.33.2-gke.1240000, Image: UBUNTU_CONTAINERD
Helm versions:
```
 helm ls -A
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                                   APP VERSION
gpu-operator            nvidia          1               2025-08-25 14:59:05.532858 +0200 CEST   deployed        gpu-operator-v25.3.2                    v25.3.2    
nvidia-dra-driver-gpu   nvidia          1               2025-08-25 15:36:56.375659 +0200 CEST   deployed        nvidia-dra-driver-gpu-25.3.0-rc.4       25.3.0-rc.4
```

GPU Operator configuration ( workaround from https://github.com/NVIDIA/nvidia-container-toolkit/issues/1222#issuecomment-3201820398)
```
toolkit:
  repository: ghcr.io/nvidia
  image: container-toolkit
  version: 8334ddec-ubuntu20.04
  env:
    - name: RUNTIME_CONFIG_SOURCES
      value: "file"
```

NVIDIA DRA  driver configuration:
```
resources.gpus.enabled: "true"
gpuResourcesEnabledOverride: "true"
nvidiaDriverRoot: /run/nvidia/driver
controller:
  tolerations:
    - key: node-role.kubernetes.io/control-plane
      operator: Exists
      effect: NoSchedule
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "nvidia.com/gpu.present"
                operator: "Exists"
kubeletPlugin:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "nvidia.com/gpu.present"
                operator: "Exists"
```

I see this issue also installing GPU drivers through https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml and NVIDIA Toolkit installation on the node using instruction from here https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#with-apt-ubuntu-debian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node status changed to NotReady when GPU is allocated through DRA #501

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node status changed to NotReady when GPU is allocated through DRA #501

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions