Skip to content

Conversation

@guptaNswati
Copy link
Contributor

This is to add a basic admission controller that intercepts Pod's nvidia.com/gpu request and translate it to DRA style resourceclaim. Details here https://github.com/NVIDIA/cloud-native-team/issues/171

Most of the boilerplate code is taken from here this webhook-demo of

  • setting up the https server
  • Reviewing the API request
  • Responding to the API request
  • generating the TLS certs and keys

with updated admission-controller API: k8s.io/api/admission/v1
Referenced from https://kubernetes.io/blog/2019/03/21/a-guide-to-kubernetes-admission-controllers/

Need to add more logic of how to properly handle multiple requests or cluster wide requests or GPU sharing and other advanced GPU sharing features.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guptaNswati guptaNswati marked this pull request as draft April 23, 2025 19:13
Comment on lines 39 to 40
gpuClaimName = "nvidia-gpu-resourceclaim"
gpuTemplateName = "nvidia-gpu-resourceclaim-template"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be configurable via a config file or CDI?

}
}

// Escape "nvidia.com/gpu" for JSON Patch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions: Are there alternatives to patching using JSON? Could we contruct these patches from the obkect directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, there should be. Need to look into it. Do you see any limitations with JSON patches?

for i, c := range pod.Spec.Containers {
foundGPU := false

if _, ok := c.Resources.Requests["nvidia.com/gpu"]; ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for the initial PR, but worthy to note in terms of follow-ups / extensions: This will not work for mixed mig mode or shared resources where the resource name is NOT nvidia.com/gpu.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES. we need to discuss about how we should approach those cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common for us to generate the certs ourselves? Would a customer provide these under certain conditions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The webhook will be our deployment, so it should be generated by us. Customer should not have to know or deal it with unless there are some restrictive enterprise environments that need to use their own certs in which case they should be able to override with their own. we can make it configurable.

@guptaNswati
Copy link
Contributor Author

guptaNswati commented Apr 25, 2025

Thank you for the initial review @elezar. Obvious things to improve

  • refactor the code
  • change the logging
  • alternative to JSON patches

Next i also want to add

  • how to handle multiple GPU request per pod (add a count field or multiple resourceclaims)
  • how to handle per-node basis change (using labels)

Edits:

  • refactor the code: Done
  • change the logging: Done

})
}
}
// append one claim per GPU
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated code test:

$ kubectl logs gpu-mutating-webhook-7f676685c8-8w5sf -n nvidia-dra-driver-gpu
2025/05/02 19:52:34 Handling webhook request ...
I0502 19:52:34.675245       1 main.go:49] skip mutation for { v1 pods}/UPDATE
2025/05/02 19:52:34 Webhook request handled successfully
2025/05/02 19:53:05 Handling webhook request ...
I0502 19:53:05.135191       1 main.go:89] removed container["main"].Resources.Requests: {remove /spec/containers/0/resources/requests/nvidia.com~1gpu <nil>}
I0502 19:53:05.135219       1 main.go:93] removed container["main"].Resources.Limits: {remove /spec/containers/0/resources/limits/nvidia.com~1gpu <nil>}
I0502 19:53:05.135226       1 main.go:100] created container["main"] empty claims array: {add /spec/containers/0/resources/claims []}
I0502 19:53:05.135236       1 main.go:112] added to container["main"].Resources.Claims: {add /spec/containers/0/resources/claims/- map[name:nvidia-gpu-resourceclaim-0]}
I0502 19:53:05.135245       1 main.go:112] added to container["main"].Resources.Claims: {add /spec/containers/0/resources/claims/- map[name:nvidia-gpu-resourceclaim-1]}
I0502 19:53:05.135249       1 main.go:123] created pod["swati-gpu-pod"] empty claims array: {add /spec/resourceClaims []}
I0502 19:53:05.135256       1 main.go:136] added ResourceClaim "nvidia-gpu-resourceclaim-0" (template="nvidia-gpu-resourceclaim-template") to "swati-gpu-pod": {add /spec/resourceClaims/- map[name:nvidia-gpu-resourceclaim-0 resourceClaimTemplateName:nvidia-gpu-resourceclaim-template]}
I0502 19:53:05.135264       1 main.go:136] added ResourceClaim "nvidia-gpu-resourceclaim-1" (template="nvidia-gpu-resourceclaim-template") to "swati-gpu-pod": {add /spec/resourceClaims/- map[name:nvidia-gpu-resourceclaim-1 resourceClaimTemplateName:nvidia-gpu-resourceclaim-template]}
2025/05/02 19:53:05 Webhook request handled successfully

$ kubectl get resourceclaim 
NAME                                             STATE                AGE
swati-gpu-pod-nvidia-gpu-resourceclaim-0-x6dln   allocated,reserved   3m10s
swati-gpu-pod-nvidia-gpu-resourceclaim-1-hkbgk   allocated,reserved   3m10s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we construct unit tests that exercise the same logic?

gopkg.in/yaml.v3
# k8s.io/api v0.32.0
## explicit; go 1.23.0
k8s.io/api/admission/v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to also add the vendor/k8s.io/api/admission/ folder to the change set.

@klueska klueska added this to the unscheduled milestone Aug 13, 2025
@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants