Skip to content

DRA driver fails with P4 #222

@johnbelamaric

Description

@johnbelamaric

I am working on a demo of using DRA to have a deployment with mixed GPU models. Trying it against a GKE node with P4 GPUs results in a failure in NodePrepareResources.

Pods:

[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$ k get po
NAME                           READY   STATUS              RESTARTS   AGE
ccc-gpu-5969dcb484-c9gkd       1/1     Running             0          21m
ccc-gpu-67c77c9bdf-hlgwt       0/1     ContainerCreating   0          10m
ccc-gpu-deb-5f54d99d8f-zbb5p   1/1     Running             0          12m
[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$ k describe po ccc-gpu-67c77c9bdf-hlgwt

IPs:              <none>
Controlled By:    ReplicaSet/ccc-gpu-67c77c9bdf
Containers:
  ctr:
    Container ID:
    Image:         ubuntu:22.04
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      while [ 1 ]; do date; echo $(nvidia-smi -L || echo Waiting...); sleep 60; done
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2vpsr (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  kube-api-access-2vpsr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              cloud.google.com/compute-class=inference-1x8x24
Tolerations:                 cloud.google.com/compute-class=inference-1x8x24:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                         Age               From               Message
  ----     ------                         ----              ----               -------
  Normal   Scheduled                      10m               default-scheduler  Successfully assigned default/ccc-gpu-67c77c9bdf-hlgwt to gke-drabeta-n1-standard-4-4xp4-f7feecbe-4h4q
  Warning  FailedPrepareDynamicResources  1s (x8 over 10m)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim default/ccc-gpu-67c77c9bdf-hlgwt-gpu-428dx: error preparing devices for claim e9504d19-2894-4331-aa0b-2c4536de9322: prepare devices failed: error applying GPU config: error setting timeslice config for requests '[gpu gpu gpu gpu]' in claim 'e9504d19-2894-4331-aa0b-2c4536de9322': error setting time slice: error running nvidia-smi: exit status 3
[hi on] jbelamaric@jbelamaric:~/proj/gh/johnbelamaric/GoogleCloudPlatform/kubernetes-engine-samples/autoscaling/custom-compute-classes/dra$

Looking at the driver log:

I1220 17:52:58.757018       1 driver.go:97] NodePrepareResource is called: number of claims: 1
E1220 17:53:10.813853       1 nvlib.go:534]
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions