GPU sharing on cuda compute capability >=7.5 by guptaNswati · Pull Request #231 · kubernetes-sigs/dra-driver-nvidia-gpu

guptaNswati · 2025-01-24T00:12:18Z

This is to add a check on allowing GPU sharing only when its a CUDA compute capability of 7.5 and higher. It skips both timeslicing and MPS. Referencing these 2 issues and related MR

#41
https://github.com/NVIDIA/cloud-native-team/issues/97
https://github.com/NVIDIA/cloud-native-team/issues/96

Tested on Geforce 980 and Titan

$ 
logs when called on incompatible GPUs
$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xbnr2 -n nvidia

I0130 23:08:07.073619       1 driver.go:108] NodeUnprepareResource is called: number of claims: 1
E0130 23:08:07.123606       1 nvlib.go:534] 
Failed to set timeslice policy with value Default for GPU 0 : Not Supported
Failed to set timeslice for requested devices : Not Supported

no MPS server running 
$ kubectl apply -f demo/specs/quickstart/gpu-test-mps.yaml

$ kubectl get pods -A
NAMESPACE            NAME                                                           READY   STATUS              RESTARTS   AGE
gpu-test-mps         test-pod                                                       0/2     ContainerCreating   0          31m
kube-system          coredns-668d6bf9bc-hwhxl                                       1/1     Running             0          34m
kube-system          coredns-668d6bf9bc-rb964                                       1/1     Running             0          34m
kube-system          etcd-k8s-dra-driver-cluster-control-plane                      1/1     Running             0          34m
kube-system          kindnet-gxfdc                                                  1/1     Running             0          34m
kube-system          kindnet-r88xt                                                  1/1     Running             0          34m
kube-system          kube-apiserver-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
kube-system          kube-controller-manager-k8s-dra-driver-cluster-control-plane   1/1     Running             0          34m
kube-system          kube-proxy-m7m4t                                               1/1     Running             0          34m
kube-system          kube-proxy-tx7bp                                               1/1     Running             0          34m
kube-system          kube-scheduler-k8s-dra-driver-cluster-control-plane            1/1     Running             0          34m
local-path-storage   local-path-provisioner-58cc7856b6-x77dz                        1/1     Running             0          34m
nvidia               nvidia-dra-driver-k8s-dra-driver-controller-844fcb94b-66wkq    1/1     Running             0          32m
nvidia               nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg          1/1     Running             0          32m

$ kubectl logs nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-vqhfg  -n nvidia
I0131 00:51:41.457384       1 device_state.go:73] using devRoot=/driver-root
I0131 00:52:26.105473       1 driver.go:97] NodePrepareResource is called: number of claims: 1
I0131 00:53:34.078698       1 driver.go:97] NodePrepareResource is called: number of claims: 1

$ kubectl get pods -n gpu-test-mps
NAME       READY   STATUS              RESTARTS   AGE
test-pod   0/2     ContainerCreating   0          29m

$ kubectl describe pod test-pod -n gpu-test-mps
Warning  FailedPrepareDynamicResources  31s (x25 over 30m)  kubelet            Failed to prepare dynamic resources: NodePrepareResources failed for claim gpu-test-mps/test-pod-shared-gpu-wfk6r: error preparing devices for claim 84b5789b-1f09-4d93-a3d3-a9fb61542cf9: prepare devices failed: error applying GPU config: GPU sharing is not available on this device UUID=GPU-34e8d7ba-0e4d-ac00-6852-695d5d404f51

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati · 2025-01-31T01:30:36Z

cc @elezar PTAL as you also reviewed #58

elezar · 2025-02-03T10:53:48Z

Thanks @guptaNswati. I will need to check how this differs from #58?

elezar · 2025-02-03T14:16:20Z

+		if deviceType.Gpu != nil {
+			cudaCCv := "v" + strings.TrimPrefix(deviceType.Gpu.cudaComputeCapability, "v")
+			gpuUUID := deviceType.Gpu.UUID
+			if semver.Compare(semver.Canonical(cudaCCv), semver.Canonical("v7.5")) >= 0 {


@guptaNswati where does the v7.5 threshold come from? In #58 we check for >= v7.0 and for MPS specifically, v3.5 is mentioned.

I picked it from our device-plugin code checking if its Volta https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/device.go#L51

elezar · 2025-02-03T14:20:51Z

+	// allow devices only with cuda compute compatility >= 7.5 as time slicing and MPS does not work with old arch
+	shareableAllocatableDevices := make(AllocatableDevices)
+	for device, deviceType := range allocatableDevices {
+		if deviceType.Gpu != nil {


Does this mean that we don't timeslice MIG devices?

In general, does it make sense to factor these checks into a function where we can better test the various combinations of options?

these changes also need in unpreprare function

elezar · 2025-02-03T14:22:14Z

 		}
-		mpsControlDaemon := s.mpsManager.NewMpsControlDaemon(string(claim.UID), allocatableDevices)
+
+		mpsControlDaemon := s.mpsManager.NewMpsControlDaemon(string(claim.UID), shareableAllocatableDevices)


Should we distinguish between timeslicing-sharable and MPS-sharable devices?

klueska · 2025-06-16T17:27:33Z

I don't think we should silently ignore requests to do time-slicing.

The way I'd like to see this take form is to

If no time-slicing config is specified, don't attempt to call any time-slicing APIs
If a user explicitly asks for Time-Slicing in the ResourceClaim.config, but the GPU doesn't support it, we error out
If a user explicitly asks for Time-Slicing in the ResourceClaim.config, and the GPU does support it, we honour it

guptaNswati · 2025-06-17T22:26:08Z

I don't think we should silently ignore requests to do time-slicing.

The way I'd like to see this take form is to

If no time-slicing config is specified, don't attempt to call any time-slicing APIs

If a user explicitly asks for Time-Slicing in the ResourceClaim.config, but the GPU doesn't support it, we error out

If a user explicitly asks for Time-Slicing in the ResourceClaim.config, and the GPU does support it, we honour it

Ack. Need to rewrite this.

jgehrcke · 2025-12-04T16:31:54Z

Let's close this for now; but we can (and should!) certainly pick up the ideas in here again if desired.

cyclinder · 2025-12-05T02:18:59Z

@jgehrcke Hi, can I pick this one?

guptaNswati changed the title ~~Draft:MPS on cuda compute capability >3.5~~ Draft: MPS on cuda compute capability >3.5 Jan 24, 2025

GPU sharing on cuda compute capability >=7.5

86de1cb

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati force-pushed the when-to-startMPS branch from 58f6bfa to 86de1cb Compare January 31, 2025 01:14

guptaNswati requested a review from klueska January 31, 2025 01:28

guptaNswati changed the title ~~Draft: MPS on cuda compute capability >3.5~~ GPU sharing on cuda compute capability >=7.5 Jan 31, 2025

guptaNswati requested a review from elezar January 31, 2025 01:29

elezar reviewed Feb 3, 2025

View reviewed changes

guptaNswati mentioned this pull request Jun 5, 2025

GPU sharing: revisit MPS support (change semantics of config, and daemon control) #362

Open

klueska linked an issue Jun 16, 2025 that may be closed by this pull request

GPU sharing: fix time-slicing config for old devices (Tesla P4) #363

Open

klueska added this to the v25.12.0 milestone Aug 13, 2025

klueska added the kind/bug Categorizes issue or PR as related to a bug. label Aug 13, 2025

klueska assigned guptaNswati Aug 14, 2025

klueska modified the milestones: v25.12.0, unscheduled Nov 26, 2025

jgehrcke closed this Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU sharing on cuda compute capability >=7.5#231

GPU sharing on cuda compute capability >=7.5#231
guptaNswati wants to merge 1 commit intokubernetes-sigs:mainfrom
guptaNswati:when-to-startMPS

guptaNswati commented Jan 24, 2025 •

edited

Loading

Uh oh!

guptaNswati commented Jan 31, 2025

Uh oh!

elezar commented Feb 3, 2025

Uh oh!

elezar Feb 3, 2025

Uh oh!

guptaNswati Feb 3, 2025

Uh oh!

elezar Feb 3, 2025

Uh oh!

elezar Feb 3, 2025

Uh oh!

cyclinder Mar 12, 2025

Uh oh!

elezar Feb 3, 2025

Uh oh!

klueska commented Jun 16, 2025

Uh oh!

guptaNswati commented Jun 17, 2025

Uh oh!

jgehrcke commented Dec 4, 2025

Uh oh!

cyclinder commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

guptaNswati commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guptaNswati commented Jan 31, 2025

Uh oh!

elezar commented Feb 3, 2025

Uh oh!

elezar Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

guptaNswati Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

cyclinder Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

klueska commented Jun 16, 2025

Uh oh!

guptaNswati commented Jun 17, 2025

Uh oh!

jgehrcke commented Dec 4, 2025

Uh oh!

cyclinder commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

guptaNswati commented Jan 24, 2025 •

edited

Loading