Draft: GPU Mutating Webhook #326

guptaNswati · 2025-04-23T19:11:08Z

This is to add a basic admission controller that intercepts Pod's nvidia.com/gpu request and translate it to DRA style resourceclaim. Details here https://github.com/NVIDIA/cloud-native-team/issues/171

Most of the boilerplate code is taken from here this webhook-demo of

setting up the https server
Reviewing the API request
Responding to the API request
generating the TLS certs and keys

with updated admission-controller API: k8s.io/api/admission/v1
Referenced from https://kubernetes.io/blog/2019/03/21/a-guide-to-kubernetes-admission-controllers/

Need to add more logic of how to properly handle multiple requests or cluster wide requests or GPU sharing and other advanced GPU sharing features.

copy-pr-bot · 2025-04-23T19:11:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

elezar · 2025-04-25T14:44:12Z

cmd/gpu-mutating-webhook/main.go

+	gpuClaimName    = "nvidia-gpu-resourceclaim"
+	gpuTemplateName = "nvidia-gpu-resourceclaim-template"


Should these be configurable via a config file or CDI?

cmd/gpu-mutating-webhook/main.go

elezar · 2025-04-25T14:50:38Z

cmd/gpu-mutating-webhook/main.go

+		}
+	}
+
+	// Escape "nvidia.com/gpu" for JSON Patch


Questions: Are there alternatives to patching using JSON? Could we contruct these patches from the obkect directly?

I think, there should be. Need to look into it. Do you see any limitations with JSON patches?

cmd/gpu-mutating-webhook/main.go

elezar · 2025-04-25T14:55:58Z

cmd/gpu-mutating-webhook/main.go

+	for i, c := range pod.Spec.Containers {
+		foundGPU := false
+
+		if _, ok := c.Resources.Requests["nvidia.com/gpu"]; ok {


Out of scope for the initial PR, but worthy to note in terms of follow-ups / extensions: This will not work for mixed mig mode or shared resources where the resource name is NOT nvidia.com/gpu.

YES. we need to discuss about how we should approach those cases.

cmd/gpu-mutating-webhook/main.go

elezar · 2025-04-25T15:03:55Z

deployments/helm/nvidia-dra-driver-gpu/generate-certs.sh

Is it common for us to generate the certs ourselves? Would a customer provide these under certain conditions?

The webhook will be our deployment, so it should be generated by us. Customer should not have to know or deal it with unless there are some restrictive enterprise environments that need to use their own certs in which case they should be able to override with their own. we can make it configurable.

guptaNswati · 2025-04-25T19:43:18Z

Thank you for the initial review @elezar. Obvious things to improve

refactor the code
change the logging
alternative to JSON patches

Next i also want to add

how to handle multiple GPU request per pod (add a count field or multiple resourceclaims)
how to handle per-node basis change (using labels)

Edits:

refactor the code: Done
change the logging: Done

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-05-02T20:03:18Z

cmd/gpu-mutating-webhook/main.go

-					})
-				}
-			}
+		// append one claim per GPU


Updated code test:

$ kubectl logs gpu-mutating-webhook-7f676685c8-8w5sf -n nvidia-dra-driver-gpu 2025/05/02 19:52:34 Handling webhook request ... I0502 19:52:34.675245 1 main.go:49] skip mutation for { v1 pods}/UPDATE 2025/05/02 19:52:34 Webhook request handled successfully 2025/05/02 19:53:05 Handling webhook request ... I0502 19:53:05.135191 1 main.go:89] removed container["main"].Resources.Requests: {remove /spec/containers/0/resources/requests/nvidia.com~1gpu <nil>} I0502 19:53:05.135219 1 main.go:93] removed container["main"].Resources.Limits: {remove /spec/containers/0/resources/limits/nvidia.com~1gpu <nil>} I0502 19:53:05.135226 1 main.go:100] created container["main"] empty claims array: {add /spec/containers/0/resources/claims []} I0502 19:53:05.135236 1 main.go:112] added to container["main"].Resources.Claims: {add /spec/containers/0/resources/claims/- map[name:nvidia-gpu-resourceclaim-0]} I0502 19:53:05.135245 1 main.go:112] added to container["main"].Resources.Claims: {add /spec/containers/0/resources/claims/- map[name:nvidia-gpu-resourceclaim-1]} I0502 19:53:05.135249 1 main.go:123] created pod["swati-gpu-pod"] empty claims array: {add /spec/resourceClaims []} I0502 19:53:05.135256 1 main.go:136] added ResourceClaim "nvidia-gpu-resourceclaim-0" (template="nvidia-gpu-resourceclaim-template") to "swati-gpu-pod": {add /spec/resourceClaims/- map[name:nvidia-gpu-resourceclaim-0 resourceClaimTemplateName:nvidia-gpu-resourceclaim-template]} I0502 19:53:05.135264 1 main.go:136] added ResourceClaim "nvidia-gpu-resourceclaim-1" (template="nvidia-gpu-resourceclaim-template") to "swati-gpu-pod": {add /spec/resourceClaims/- map[name:nvidia-gpu-resourceclaim-1 resourceClaimTemplateName:nvidia-gpu-resourceclaim-template]} 2025/05/02 19:53:05 Webhook request handled successfully $ kubectl get resourceclaim NAME STATE AGE swati-gpu-pod-nvidia-gpu-resourceclaim-0-x6dln allocated,reserved 3m10s swati-gpu-pod-nvidia-gpu-resourceclaim-1-hkbgk allocated,reserved 3m10s

Could we construct unit tests that exercise the same logic?

elezar · 2025-05-07T14:00:45Z

vendor/modules.txt

 gopkg.in/yaml.v3
 # k8s.io/api v0.32.0
 ## explicit; go 1.23.0
+k8s.io/api/admission/v1


We probably need to also add the vendor/k8s.io/api/admission/ folder to the change set.

guptaNswati added 3 commits April 23, 2025 18:21

Add gpu mutating webhook

b53e354

Vendor update

13cdba2

Add helm deployment

ec7c440

guptaNswati requested review from cdesiniotis, elezar and klueska April 23, 2025 19:12

guptaNswati marked this pull request as draft April 23, 2025 19:13

guptaNswati mentioned this pull request Apr 23, 2025

Add more gpu examples #327

Open