I'm able to successfully use DRA with MIG on a new node (and configure it as described below) but I'm not able to use DRA with MIG on the GPU that was earlier used by Pod requesting GPU through DRA.
kubectl logs -n nvidia nvidia-dra-driver-gpu-kubelet-plugin-dtg7h -p
Defaulted container "compute-domains" out of: compute-domains, gpus, init-container (init)
I0818 20:23:58.526514 1 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I0818 20:23:58.526580 1 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=false
I0818 20:23:58.526589 1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true
I0818 20:23:58.526595 1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=false
I0818 20:23:58.526601 1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I0818 20:23:58.553994 1 mount_linux.go:324] Detected umount with safe 'not mounted' behavior
I0818 20:23:58.555460 1 device_state.go:71] using devRoot=/driver-root
Error: error creating driver: unable to create base CDI spec file: unable to get all GPU device specs: no NVIDIA device nodes found
Environment:
Kuberentes version: 1.33.2-gke.1240000
Image type: UBUNTU_CONTAINERD
GPU: A100
GPU driver installation:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
DRA helm chart installation:
cat <<EOF > dra_values.yaml
resources.gpus.enabled: "true"
gpuResourcesEnabledOverride: "true"
nvidiaDriverRoot: /opt/nvidia
controller:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "nvidia.com/gpu.present"
operator: "Exists"
kubeletPlugin:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "nvidia.com/gpu.present"
operator: "Exists"
EOF
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.3.0-rc.4" \
--namespace nvidia \
-f dra_values.yaml
Steps to reproduce
- Create some pods that request GPU through DRA, e.g.
---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test5
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
namespace: gpu-test5
name: timeslicing
spec:
devices:
requests:
- name: ts-gpu
deviceClassName: gpu.nvidia.com
config:
- requests: ["ts-gpu"]
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
kind: GpuConfig
sharing:
strategy: TimeSlicing
timeSlicingConfig:
interval: Long
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod0
namespace: gpu-test5
spec:
replicas: 3
selector:
matchLabels:
app: pod0
template:
metadata:
labels:
app: pod0
spec:
containers:
- name: ts-ctr0
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
command: ["bash", "-c"]
args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
resources:
claims:
- name: shared-gpus
request: ts-gpu
- name: ts-ctr1
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
command: ["bash", "-c"]
args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
resources:
claims:
- name: shared-gpus
request: ts-gpu
resourceClaims:
- name: shared-gpus
resourceClaimName: timeslicing
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
- Check that all pods are running and delete all resources created in step 1.
- Configure MIG on the node
sudo nvidia-smi -mig 1
sudo reboot
wget https://github.com/NVIDIA/mig-parted/releases/download/v0.12.1/nvidia-mig-manager-0.12.1-1.x86_64.tar.gz && \
tar -xzf nvidia-mig-manager-0.12.1-1.x86_64.tar.gz
cd nvidia-mig-manager-0.12.1-1 && \
wget https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/heads/main/demo/specs/quickstart/mig-parted-config.yaml && \
sudo -E ./nvidia-mig-parted apply -f mig-parted-config.yaml -c half-balanced
- Verify MIG configuration - it should be:
nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ab966f59-814f-63b9-35b2-b44c17c104d5)
MIG 3g.20gb Device 0: (UUID: MIG-0be7b14e-f145-573e-b3c6-ac58c99f34e1)
MIG 2g.10gb Device 1: (UUID: MIG-f6e8a7c9-9b65-55d1-a7c9-040decbb3765)
MIG 1g.5gb Device 2: (UUID: MIG-563be5fe-733e-5ce1-ab54-baaa523cff28)
MIG 1g.5gb Device 3: (UUID: MIG-ba84be96-dc4a-5792-ac4f-139f79bf6e7b)
- Check that
nvidia-dra-driver-gpu-kubelet-plugin Pod is constantly restarted
I'm able to successfully use DRA with MIG on a new node (and configure it as described below) but I'm not able to use DRA with MIG on the GPU that was earlier used by Pod requesting GPU through DRA.
Environment:
Kuberentes version:
1.33.2-gke.1240000Image type:
UBUNTU_CONTAINERDGPU: A100
GPU driver installation:
DRA helm chart installation:
Steps to reproduce
nvidia-dra-driver-gpu-kubelet-pluginPod is constantly restarted