Skip to content

Node status changed to NotReady when GPU is allocated through DRA #501

@kasia-kujawa

Description

@kasia-kujawa

Node status changed to NotReady when GPU is allocated through DRA, node describe indicates some restarts of containerd,
full node describe: https://gist.github.com/kasia-kujawa/abd0db19c42179bff3134f3e96ae652a

Steps to reproduce:

  1. Install GPU Operator
  2. Install NVIDIA DRA DRIVER
  3. Create a Deployment that uses DRA allocation

---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test5

---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
  namespace: gpu-test5
  name: timeslicing
spec:
  devices:
    requests:
    - name: ts-gpu
      deviceClassName: gpu.nvidia.com
    config:
    - requests: ["ts-gpu"]
      opaque:
        driver: gpu.nvidia.com
        parameters:
          apiVersion: resource.nvidia.com/v1beta1
          kind: GpuConfig
          sharing:
            strategy: TimeSlicing
            timeSlicingConfig:
              interval: Long
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod0
  namespace: gpu-test5
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pod0
  template:
    metadata:
      labels:
        app: pod0
    spec:
      containers:
        - name: ts-ctr0
          image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
          command: ["bash", "-c"]
          args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
          resources:
            claims:
              - name: shared-gpus
                request: ts-gpu
        - name: ts-ctr1
          image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
          command: ["bash", "-c"]
          args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
          resources:
            claims:
              - name: shared-gpus
                request: ts-gpu
      resourceClaims:
        - name: shared-gpus
          resourceClaimName: timeslicing
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

  1. Check that pods are running
  2. Observe GPU node status, after a while it will change to NotReady
  3. After a while node will have Ready status and then again change status to NodeReady

Environment:
GKE 1.33.2-gke.1240000, Image: UBUNTU_CONTAINERD
Helm versions:

 helm ls -A
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                                   APP VERSION
gpu-operator            nvidia          1               2025-08-25 14:59:05.532858 +0200 CEST   deployed        gpu-operator-v25.3.2                    v25.3.2    
nvidia-dra-driver-gpu   nvidia          1               2025-08-25 15:36:56.375659 +0200 CEST   deployed        nvidia-dra-driver-gpu-25.3.0-rc.4       25.3.0-rc.4

GPU Operator configuration ( workaround from NVIDIA/nvidia-container-toolkit#1222 (comment))

toolkit:
  repository: ghcr.io/nvidia
  image: container-toolkit
  version: 8334ddec-ubuntu20.04
  env:
    - name: RUNTIME_CONFIG_SOURCES
      value: "file"

NVIDIA DRA driver configuration:

resources.gpus.enabled: "true"
gpuResourcesEnabledOverride: "true"
nvidiaDriverRoot: /run/nvidia/driver
controller:
  tolerations:
    - key: node-role.kubernetes.io/control-plane
      operator: Exists
      effect: NoSchedule
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "nvidia.com/gpu.present"
                operator: "Exists"
kubeletPlugin:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "nvidia.com/gpu.present"
                operator: "Exists"

I see this issue also installing GPU drivers through https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml and NVIDIA Toolkit installation on the node using instruction from here https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#with-apt-ubuntu-debian

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions