Skip to content

Unable to use DRA with MIG on earlier used GPU #473

@kasia-kujawa

Description

@kasia-kujawa

I'm able to successfully use DRA with MIG on a new node (and configure it as described below) but I'm not able to use DRA with MIG on the GPU that was earlier used by Pod requesting GPU through DRA.

kubectl logs -n nvidia nvidia-dra-driver-gpu-kubelet-plugin-dtg7h -p
Defaulted container "compute-domains" out of: compute-domains, gpus, init-container (init)
I0818 20:23:58.526514       1 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I0818 20:23:58.526580       1 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0818 20:23:58.526589       1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true
I0818 20:23:58.526595       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0818 20:23:58.526601       1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I0818 20:23:58.553994       1 mount_linux.go:324] Detected umount with safe 'not mounted' behavior
I0818 20:23:58.555460       1 device_state.go:71] using devRoot=/driver-root
Error: error creating driver: unable to create base CDI spec file: unable to get all GPU device specs: no NVIDIA device nodes found

Environment:

Kuberentes version: 1.33.2-gke.1240000
Image type: UBUNTU_CONTAINERD
GPU: A100

GPU driver installation:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

DRA helm chart installation:

cat <<EOF > dra_values.yaml
resources.gpus.enabled: "true"
gpuResourcesEnabledOverride: "true"
nvidiaDriverRoot: /opt/nvidia

controller:
  tolerations:
    - key: node-role.kubernetes.io/control-plane
      operator: Exists
      effect: NoSchedule
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "nvidia.com/gpu.present"
                operator: "Exists"
kubeletPlugin:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "nvidia.com/gpu.present"
                operator: "Exists"
EOF

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update

helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.0-rc.4" \
  --namespace nvidia \
  -f dra_values.yaml

Steps to reproduce

  1. Create some pods that request GPU through DRA, e.g.
---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test5

---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
  namespace: gpu-test5
  name: timeslicing
spec:
  devices:
    requests:
    - name: ts-gpu
      deviceClassName: gpu.nvidia.com
    config:
    - requests: ["ts-gpu"]
      opaque:
        driver: gpu.nvidia.com
        parameters:
          apiVersion: resource.nvidia.com/v1beta1
          kind: GpuConfig
          sharing:
            strategy: TimeSlicing
            timeSlicingConfig:
              interval: Long
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod0
  namespace: gpu-test5
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pod0
  template:
    metadata:
      labels:
        app: pod0
    spec:
      containers:
        - name: ts-ctr0
          image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
          command: ["bash", "-c"]
          args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
          resources:
            claims:
              - name: shared-gpus
                request: ts-gpu
        - name: ts-ctr1
          image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
          command: ["bash", "-c"]
          args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
          resources:
            claims:
              - name: shared-gpus
                request: ts-gpu
      resourceClaims:
        - name: shared-gpus
          resourceClaimName: timeslicing
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
  1. Check that all pods are running and delete all resources created in step 1.
  2. Configure MIG on the node
sudo nvidia-smi -mig 1
sudo reboot


wget https://github.com/NVIDIA/mig-parted/releases/download/v0.12.1/nvidia-mig-manager-0.12.1-1.x86_64.tar.gz && \
  tar -xzf nvidia-mig-manager-0.12.1-1.x86_64.tar.gz
cd nvidia-mig-manager-0.12.1-1 && \
  wget https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/heads/main/demo/specs/quickstart/mig-parted-config.yaml && \
  sudo -E ./nvidia-mig-parted apply -f mig-parted-config.yaml -c half-balanced

  1. Verify MIG configuration - it should be:
nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ab966f59-814f-63b9-35b2-b44c17c104d5)
  MIG 3g.20gb     Device  0: (UUID: MIG-0be7b14e-f145-573e-b3c6-ac58c99f34e1)
  MIG 2g.10gb     Device  1: (UUID: MIG-f6e8a7c9-9b65-55d1-a7c9-040decbb3765)
  MIG 1g.5gb      Device  2: (UUID: MIG-563be5fe-733e-5ce1-ab54-baaa523cff28)
  MIG 1g.5gb      Device  3: (UUID: MIG-ba84be96-dc4a-5792-ac4f-139f79bf6e7b)
  1. Check that nvidia-dra-driver-gpu-kubelet-plugin Pod is constantly restarted

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions