-
Notifications
You must be signed in to change notification settings - Fork 94
Draft: GPU Mutating Webhook #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cmd/gpu-mutating-webhook/main.go
Outdated
| gpuClaimName = "nvidia-gpu-resourceclaim" | ||
| gpuTemplateName = "nvidia-gpu-resourceclaim-template" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be configurable via a config file or CDI?
cmd/gpu-mutating-webhook/main.go
Outdated
| } | ||
| } | ||
|
|
||
| // Escape "nvidia.com/gpu" for JSON Patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions: Are there alternatives to patching using JSON? Could we contruct these patches from the obkect directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, there should be. Need to look into it. Do you see any limitations with JSON patches?
cmd/gpu-mutating-webhook/main.go
Outdated
| for i, c := range pod.Spec.Containers { | ||
| foundGPU := false | ||
|
|
||
| if _, ok := c.Resources.Requests["nvidia.com/gpu"]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope for the initial PR, but worthy to note in terms of follow-ups / extensions: This will not work for mixed mig mode or shared resources where the resource name is NOT nvidia.com/gpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES. we need to discuss about how we should approach those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it common for us to generate the certs ourselves? Would a customer provide these under certain conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The webhook will be our deployment, so it should be generated by us. Customer should not have to know or deal it with unless there are some restrictive enterprise environments that need to use their own certs in which case they should be able to override with their own. we can make it configurable.
|
Thank you for the initial review @elezar. Obvious things to improve
Next i also want to add
Edits:
|
Signed-off-by: Swati Gupta <[email protected]>
| }) | ||
| } | ||
| } | ||
| // append one claim per GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated code test:
$ kubectl logs gpu-mutating-webhook-7f676685c8-8w5sf -n nvidia-dra-driver-gpu
2025/05/02 19:52:34 Handling webhook request ...
I0502 19:52:34.675245 1 main.go:49] skip mutation for { v1 pods}/UPDATE
2025/05/02 19:52:34 Webhook request handled successfully
2025/05/02 19:53:05 Handling webhook request ...
I0502 19:53:05.135191 1 main.go:89] removed container["main"].Resources.Requests: {remove /spec/containers/0/resources/requests/nvidia.com~1gpu <nil>}
I0502 19:53:05.135219 1 main.go:93] removed container["main"].Resources.Limits: {remove /spec/containers/0/resources/limits/nvidia.com~1gpu <nil>}
I0502 19:53:05.135226 1 main.go:100] created container["main"] empty claims array: {add /spec/containers/0/resources/claims []}
I0502 19:53:05.135236 1 main.go:112] added to container["main"].Resources.Claims: {add /spec/containers/0/resources/claims/- map[name:nvidia-gpu-resourceclaim-0]}
I0502 19:53:05.135245 1 main.go:112] added to container["main"].Resources.Claims: {add /spec/containers/0/resources/claims/- map[name:nvidia-gpu-resourceclaim-1]}
I0502 19:53:05.135249 1 main.go:123] created pod["swati-gpu-pod"] empty claims array: {add /spec/resourceClaims []}
I0502 19:53:05.135256 1 main.go:136] added ResourceClaim "nvidia-gpu-resourceclaim-0" (template="nvidia-gpu-resourceclaim-template") to "swati-gpu-pod": {add /spec/resourceClaims/- map[name:nvidia-gpu-resourceclaim-0 resourceClaimTemplateName:nvidia-gpu-resourceclaim-template]}
I0502 19:53:05.135264 1 main.go:136] added ResourceClaim "nvidia-gpu-resourceclaim-1" (template="nvidia-gpu-resourceclaim-template") to "swati-gpu-pod": {add /spec/resourceClaims/- map[name:nvidia-gpu-resourceclaim-1 resourceClaimTemplateName:nvidia-gpu-resourceclaim-template]}
2025/05/02 19:53:05 Webhook request handled successfully
$ kubectl get resourceclaim
NAME STATE AGE
swati-gpu-pod-nvidia-gpu-resourceclaim-0-x6dln allocated,reserved 3m10s
swati-gpu-pod-nvidia-gpu-resourceclaim-1-hkbgk allocated,reserved 3m10s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we construct unit tests that exercise the same logic?
| gopkg.in/yaml.v3 | ||
| # k8s.io/api v0.32.0 | ||
| ## explicit; go 1.23.0 | ||
| k8s.io/api/admission/v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to also add the vendor/k8s.io/api/admission/ folder to the change set.
This is to add a basic admission controller that intercepts Pod's nvidia.com/gpu request and translate it to DRA style resourceclaim. Details here https://github.com/NVIDIA/cloud-native-team/issues/171
Most of the boilerplate code is taken from here this webhook-demo of
with updated admission-controller API: k8s.io/api/admission/v1
Referenced from https://kubernetes.io/blog/2019/03/21/a-guide-to-kubernetes-admission-controllers/
Need to add more logic of how to properly handle multiple requests or cluster wide requests or GPU sharing and other advanced GPU sharing features.