Unable to use DRA with MIG on earlier used GPU

I'm able to successfully use DRA with MIG on a new node (and configure it as described below) but I'm not able to use DRA with MIG on the GPU that was earlier used by Pod requesting GPU through DRA.

```
kubectl logs -n nvidia nvidia-dra-driver-gpu-kubelet-plugin-dtg7h -p
Defaulted container "compute-domains" out of: compute-domains, gpus, init-container (init)
I0818 20:23:58.526514       1 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I0818 20:23:58.526580       1 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0818 20:23:58.526589       1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true
I0818 20:23:58.526595       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0818 20:23:58.526601       1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I0818 20:23:58.553994       1 mount_linux.go:324] Detected umount with safe 'not mounted' behavior
I0818 20:23:58.555460       1 device_state.go:71] using devRoot=/driver-root
Error: error creating driver: unable to create base CDI spec file: unable to get all GPU device specs: no NVIDIA device nodes found
```

###  Environment:
Kuberentes version: `1.33.2-gke.1240000`
Image type: `UBUNTU_CONTAINERD`
GPU: A100

GPU driver installation:
```
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
```

DRA helm chart installation:
```
cat <<EOF > dra_values.yaml
resources.gpus.enabled: "true"
gpuResourcesEnabledOverride: "true"
nvidiaDriverRoot: /opt/nvidia

controller:
  tolerations:
    - key: node-role.kubernetes.io/control-plane
      operator: Exists
      effect: NoSchedule
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "nvidia.com/gpu.present"
                operator: "Exists"
kubeletPlugin:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: "nvidia.com/gpu.present"
                operator: "Exists"
EOF

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update

helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.0-rc.4" \
  --namespace nvidia \
  -f dra_values.yaml
```

### Steps to reproduce

1. Create some pods that request GPU through DRA, e.g.

```
---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test5

---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
  namespace: gpu-test5
  name: timeslicing
spec:
  devices:
    requests:
    - name: ts-gpu
      deviceClassName: gpu.nvidia.com
    config:
    - requests: ["ts-gpu"]
      opaque:
        driver: gpu.nvidia.com
        parameters:
          apiVersion: resource.nvidia.com/v1beta1
          kind: GpuConfig
          sharing:
            strategy: TimeSlicing
            timeSlicingConfig:
              interval: Long
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod0
  namespace: gpu-test5
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pod0
  template:
    metadata:
      labels:
        app: pod0
    spec:
      containers:
        - name: ts-ctr0
          image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
          command: ["bash", "-c"]
          args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
          resources:
            claims:
              - name: shared-gpus
                request: ts-gpu
        - name: ts-ctr1
          image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
          command: ["bash", "-c"]
          args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
          resources:
            claims:
              - name: shared-gpus
                request: ts-gpu
      resourceClaims:
        - name: shared-gpus
          resourceClaimName: timeslicing
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
```

2. Check that all pods are running and delete all resources created in step 1.
3. Configure MIG on the node
```
sudo nvidia-smi -mig 1
sudo reboot


wget https://github.com/NVIDIA/mig-parted/releases/download/v0.12.1/nvidia-mig-manager-0.12.1-1.x86_64.tar.gz && \
  tar -xzf nvidia-mig-manager-0.12.1-1.x86_64.tar.gz
cd nvidia-mig-manager-0.12.1-1 && \
  wget https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/heads/main/demo/specs/quickstart/mig-parted-config.yaml && \
  sudo -E ./nvidia-mig-parted apply -f mig-parted-config.yaml -c half-balanced

```
4. Verify MIG configuration - it should be:
```
nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ab966f59-814f-63b9-35b2-b44c17c104d5)
  MIG 3g.20gb     Device  0: (UUID: MIG-0be7b14e-f145-573e-b3c6-ac58c99f34e1)
  MIG 2g.10gb     Device  1: (UUID: MIG-f6e8a7c9-9b65-55d1-a7c9-040decbb3765)
  MIG 1g.5gb      Device  2: (UUID: MIG-563be5fe-733e-5ce1-ab54-baaa523cff28)
  MIG 1g.5gb      Device  3: (UUID: MIG-ba84be96-dc4a-5792-ac4f-139f79bf6e7b)
```
5. Check that `nvidia-dra-driver-gpu-kubelet-plugin` Pod is constantly restarted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use DRA with MIG on earlier used GPU #473

Environment:

Steps to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to use DRA with MIG on earlier used GPU #473

Description

Environment:

Steps to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions