I am working on a demo of using DRA to have a deployment with mixed GPU models. Trying it against a GKE node with P4 GPUs results in a failure in NodePrepareResources.
Pods:
[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$ k get po
NAME READY STATUS RESTARTS AGE
ccc-gpu-5969dcb484-c9gkd 1/1 Running 0 21m
ccc-gpu-67c77c9bdf-hlgwt 0/1 ContainerCreating 0 10m
ccc-gpu-deb-5f54d99d8f-zbb5p 1/1 Running 0 12m
[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$ k describe po ccc-gpu-67c77c9bdf-hlgwt
IPs: <none>
Controlled By: ReplicaSet/ccc-gpu-67c77c9bdf
Containers:
ctr:
Container ID:
Image: ubuntu:22.04
Image ID:
Port: <none>
Host Port: <none>
Command:
bash
-c
Args:
while [ 1 ]; do date; echo $(nvidia-smi -L || echo Waiting...); sleep 60; done
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2vpsr (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-2vpsr:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: cloud.google.com/compute-class=inference-1x8x24
Tolerations: cloud.google.com/compute-class=inference-1x8x24:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned default/ccc-gpu-67c77c9bdf-hlgwt to gke-drabeta-n1-standard-4-4xp4-f7feecbe-4h4q
Warning FailedPrepareDynamicResources 1s (x8 over 10m) kubelet Failed to prepare dynamic resources: NodePrepareResources failed for claim default/ccc-gpu-67c77c9bdf-hlgwt-gpu-428dx: error preparing devices for claim e9504d19-2894-4331-aa0b-2c4536de9322: prepare devices failed: error applying GPU config: error setting timeslice config for requests '[gpu gpu gpu gpu]' in claim 'e9504d19-2894-4331-aa0b-2c4536de9322': error setting time slice: error running nvidia-smi: exit status 3
[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$
Looking at the driver log:
I1220 17:52:58.757018 1 driver.go:97] NodePrepareResource is called: number of claims: 1
E1220 17:53:10.813853 1 nvlib.go:534]
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported
I am working on a demo of using DRA to have a deployment with mixed GPU models. Trying it against a GKE node with P4 GPUs results in a failure in NodePrepareResources.
Pods:
Looking at the driver log: