-
Notifications
You must be signed in to change notification settings - Fork 94
Add label app.kubernetes.io/name:nvidia-dra-driver-gpu to all project components #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds an app: nvidia-dra-driver-gpu label across templates, Helm charts, and controller code to simplify filtering of GPU driver components.
- Inject
AppLabelKey/AppLabelValueinto all Kubernetes template manifests - Update Helm
values.yamlforcontrollerandkubeletPluginsections - Propagate and assign new
appLabelKey/appLabelValueconstants in controller code
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| templates/compute-domain-workload-claim-template.tmpl.yaml | Added {{ .AppLabelKey }} label in metadata |
| templates/compute-domain-daemon.tmpl.yaml | Added {{ .AppLabelKey }} label in metadata |
| templates/compute-domain-daemon-claim-template.tmpl.yaml | Added {{ .AppLabelKey }} label in metadata |
| deployments/helm/nvidia-dra-driver-gpu/values.yaml | Added app: nvidia-dra-driver-gpu under controller and kubeletPlugin |
| cmd/gpu-kubelet-plugin/driver.go | Bumped import from v1beta1 → v1beta2 |
| cmd/gpu-kubelet-plugin/device_state.go | Bumped import from v1beta1 → v1beta2 |
| cmd/compute-domain-controller/resourceclaimtemplate.go | Extended ResourceClaimTemplateTemplateData with AppLabelKey/AppLabelValue |
| cmd/compute-domain-controller/daemonset.go | Extended DaemonSetTemplateData and updated buffer/variable names |
| cmd/compute-domain-controller/computedomain.go | Introduced appLabelKey & appLabelValue consts |
Comments suppressed due to low confidence (6)
cmd/compute-domain-controller/resourceclaimtemplate.go:53
- New fields
AppLabelKeyandAppLabelValuehave been added but there are no existing unit tests verifying that these values are correctly injected into generated templates. Consider adding tests for template data binding.
AppLabelValue string
templates/compute-domain-workload-claim-template.tmpl.yaml:11
- Indentation is off by two spaces; this line should align with the surrounding labels (6 spaces before the template expression) to produce valid YAML.
{{ .AppLabelKey }}: {{ .AppLabelValue }}
templates/compute-domain-daemon.tmpl.yaml:11
- Indentation is off by two spaces; this line should align with the surrounding labels (6 spaces before the template expression) to produce valid YAML.
{{ .AppLabelKey }}: {{ .AppLabelValue }}
templates/compute-domain-daemon-claim-template.tmpl.yaml:11
- Indentation is off by two spaces; this line should align with the surrounding labels (6 spaces before the template expression) to produce valid YAML.
{{ .AppLabelKey }}: {{ .AppLabelValue }}
deployments/helm/nvidia-dra-driver-gpu/values.yaml:64
- The
labels:block undercontrolleris indented only 2 spaces but should be 4 spaces to match the rest of that section.
labels:
deployments/helm/nvidia-dra-driver-gpu/values.yaml:87
- The
labels:block underkubeletPluginis indented only 2 spaces but should be 4 spaces to match the rest of that section.
labels:
910006f to
b7d3b0b
Compare
|
Is there a reason the existing ComputeDomainLabelKey isn't sufficient for what you are trying to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an "app" label (with the value "nvidia-dra-driver-gpu") to various project components to facilitate easier filtering and debugging.
- Updated YAML templates to include the new app label.
- Updated Helm values to add the app label to controller and kubeletPlugin sections.
- Enhanced Go source code to pass the app label values through the necessary structures and API calls.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| templates/compute-domain-workload-claim-template.tmpl.yaml | App label added to resource metadata labels. |
| templates/compute-domain-daemon.tmpl.yaml | App label added to resource metadata labels. |
| templates/compute-domain-daemon-claim-template.tmpl.yaml | App label added to resource metadata labels. |
| deployments/helm/nvidia-dra-driver-gpu/values.yaml | Helm values updated to include app label for controller and kubeletPlugin. |
| cmd/compute-domain-controller/resourceclaimtemplate.go | New fields added for managing app label data. |
| cmd/compute-domain-controller/daemonset.go | Updated to use new app label fields and corrected variable naming for consistency. |
| cmd/compute-domain-controller/computedomain.go | Defined app label constants used across the codebase. |
We want a label that is generic, for non |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an "app" label to various project components to improve filtering during debugging. The key changes include:
- Adding the "app" label to multiple YAML templates across compute domain and helm deployment files.
- Introducing a CLI flag ("chart-name") to propagate the label value consistently.
- Updating resource templating code to include the new AppLabelValue.
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| templates/compute-domain-workload-claim-template.tmpl.yaml | Added app label to workload claim template metadata. |
| templates/compute-domain-daemon.tmpl.yaml | Added app label to daemon template metadata. |
| templates/compute-domain-daemon-claim-template.tmpl.yaml | Added app label to daemon claim template metadata. |
| deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml | Added app label (using .Chart.Name) to kubelet plugin metadata. |
| deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml | Added app label and HELM_CHART_NAME env variable to controller metadata. |
| cmd/compute-domain-controller/resourceclaimtemplate.go | Introduced AppLabelValue in template data and assigned it from the chartName field. |
| cmd/compute-domain-controller/main.go | Added a required CLI flag for chart-name with a default value. |
| cmd/compute-domain-controller/daemonset.go | Introduced AppLabelValue in daemonset template data and assigned it from m.config.chartName. |
| cmd/compute-domain-controller/controller.go | Updated ManagerConfig and Run method to pass chartName to computedomain managers. |
| cmd/compute-domain-controller/computedomain.go | Minor formatting changes (addition of a newline) in constant definitions. |
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml
Outdated
Show resolved
Hide resolved
deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the mps-daemon template seems to be missing
Yeah on that one, there is already an |
change it to |
MPS now included |
| Finalizer: computeDomainFinalizer, | ||
| ComputeDomainLabelKey: computeDomainLabelKey, | ||
| ComputeDomainLabelValue: cd.UID, | ||
| AppLabelValue: m.config.chartName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call this AppLabelValue here, but HelmChartValue in:
https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/381/files#diff-eb2c222c2a3789dcc8b91ee29a09d7e4c0f7e287b8331e7df07767d1661da7d6R204
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, fixed
cmd/gpu-kubelet-plugin/sharing.go
Outdated
| NodeName: m.nodeName, | ||
| MpsControlDaemonNamespace: m.namespace, | ||
| MpsControlDaemonName: m.name, | ||
| HelmChartName: m.manager.config.flags.chartName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't going tow ork because you call the variable AppLabelValue in the template itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, fixed
|
Thanks!
In the PR description, can you a command (input and output) that demonstrates the problem you were addressing here? Are you saying some components have certain labels set, and others do not? I like to especially understand how these labels relate to the labels we apply via Helm templating:
Are you basically trying to bring |
| // and work queue management. | ||
| type ManagerConfig struct { | ||
| // chartName is the Helm chart name to use for the app label value | ||
| chartName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the app label value
Because I recently worked myself into the various labels we want to use here, I'd like for us to be more precise. What is the "app label"? Can we maybe just use the actual name of the label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, further below I see a literal "app" used as label name. How standard is that?
In the Helm template logic we include the Helm chart name as the value for a label named app.kuberenetes.io/name:
app.kubernetes.io/name: {{ include "nvidia-dra-driver-gpu.name" . }}
| app.kubernetes.io/name: {{ include "nvidia-dra-driver-gpu.name" . }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How standard is that?
We use it by default in all our GPU-Operator components, is quite a common practice.
In the Helm template logic we include the Helm chart name as the value for a label named app.kuberenetes.io/name:
Yes, but this value is dynamic; use could be overwritten using Helm flags. We need a constant label to look for all our components during a debugging session, or when running the must-gather.sh support script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this value is dynamic; use could be overwritten using Helm flags
About label value
I see. :) Even if nobody is really doing that (overriding the name -- is someone doing that? maybe!) I get your point. We want something predictable. OK!
About label name
I also see that app.kubernetes.io/name is clumsy. On the other hand, app is overused and super generic. Just for the exercise: do we have a better idea than 'app' that is also short/concise? Something that we would like better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Overlapping with what I commented before, with maybe a fresh perspective: I think this is probably super important, and I want to align our mental models here about problem definition, and solution space. In general, we probably can describe the problem as: we have two orthogonal templating systems (Helm templates, Go templates), and they are not yet aligned in terms of which labels they apply to resources. So, we want to achieve consistency between both. Is that right? Is that what we're aiming to do here? We're probably looking for that one command that shows all entities that were created directly or indirectly from this Helm chart. Is that your goal? If yes: that's an important goal, ack (e.g., for finding orphaned resources). What's the command that in your opinion we'd like to run for achieving this goal? |
| namespace: {{ include "nvidia-dra-driver-gpu.namespace" . }} | ||
| labels: | ||
| app: {{ .Chart.Name }} | ||
| {{- include "nvidia-dra-driver-gpu.labels" . | nindent 4 }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the patch raises two interesting points:
-
include "nvidia-dra-driver-gpu.labelsalready includes a label with the Helm chart name. That label's name isapp.kubernetes.io/name. -
We hardcode
appto be the chart name -- that is, there is no other choice. By that logic, we could remove one layer of indirection, and renameAppLabelValuetoChartNamein Go code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, yes and no. Because the CharName can be overwritten with Helm flags. What we are looking at here is a constant and persistent AppLabel that we can use to filter all our components in a cluster. Regardless of how complicated the user has installed them, a constant label gives us that single point to filter for the project components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you gave a response to my first bullet point above (but not to the second? it's fine, this question will come back later).
CharName can be overwritten with Helm flags. What we are looking at here is a constant and persistent AppLabel
I think you're saying that we should not use a label value that is or depends on Chart.Name. And I think you just convinced me of that. Because that's dynamic, and not predictable.
However, the patch currently contains: app: {{ .Chart.Name }}.
And now I am confused. I think it shows that it's good to (together) narrow down the exact purpose/goal that we want to achieve here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are totally right, how should we call that variable? AppName?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.Chart.Name cannot be overridden. It is a property of the chart. That said -- once we move this as an operand under GPU operator, the .Chart.Name will be gpu-operator-- not sure if that matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that means we will have a new entry to the assets folder on the gpu-operator repo. if so, all the assets there have the app:gpu-operator label.
but in case a user wants to deploy this stand alone, having the label as proposed in this PR still makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we move this as an operand under GPU operator, the .Chart.Name will be gpu-operator-- not sure if that matters.
That matters!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.Chart.Name cannot be overridden.
ack -- btw, ref docs: https://helm.sh/docs/chart_template_guide/builtin_objects/
And for the record, I wondered about the helm install ... --generate-name arg which I had used, documented here: https://helm.sh/docs/helm/helm_install/
--generate-name generate the name (and omit the NAME parameter)
This is not the chart name (and these docs aren't great).
The intent of this patch is to make the DRA project behave similarly to the GPU-Operator, which has the app: gpu-operator in all components So you can look for the operator like Given that users might install the GPU-Operator and NVIDIA-DRA-Driver into a custom-named namespace, our official way of requesting debugging info from customers is asking them to run this script https://github.com/NVIDIA/gpu-operator/blob/main/hack/must-gather.sh. The idea is to have a label we can trust that will always be set to all components of this project, regardless of how or where the users deploy the project. |
Not really, as time has shown, users could deploy this without Helm, or using a very custom Helm deploy command, which would potentially overwrite our labels, as they are variable. |
|
Further above, I asked
and
and we haven't written that down yet. Let's do that! I believe that this is important for us to understand what we're really doing here. We want one command that currently has unexpected output. And after the patch, it has expected output. (this helps us to understand, for example: do we want to filter only pods, or also other objects? I was assuming that we do not only want to filter pods. In the script that you linked I see we do something like |
… components Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Thanks for confirming. Many thoughts were exchanged here, I try to summarize what I have understood so far in terms of problem/solution: (maybe I am wrong! :)) Goal / problemI think we narrowed down the goal to:
Method / solutionLet's pick a custom, but concise and expressive LABEL_NAME. Maybe I think as LABEL_VALUE we want to keep using And then apply this to all pods. ImplementationAs pragmatic as possible (I don't think we need CLI args, or anything dynamic -- just hard-code string literals in the relevant places for now). About this:
OK. But this is more of a future-proofing aspect, right? For this PR maybe we should still assume Helm installation. Right? That's important for potential changes that this PR makes to templates in |
This patch proposes the
app.kubernetes.io/namelabel to all managed components of both the computeDomain controller and the kubelet plugin, enabling better identification and organization of Kubernetes resources. The changes primarily involve adding support for theappNameconfiguration, updating templates to include the new label, and modifying Helm charts to propagate the label value.As recommended in https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/#labels we use the label
app.kubernetes.io/nameacross all project components.kubectl get pods -lapp.kubernetes.io/name=nvidia-dra-driver-gpu -A -ojsonpath={.items[].metadata.namespace} --ignore-not-foundThen becomes sufficient to identify all pods deployed from the project, both controllers and managed pods.
This label is of interest only in pods, as objects like the ComputeDomain CRD can be easily retrieved by filtering for it's name
Will retrieve all computeDomains in the cluster.
This is important when debugging in production environments, where we don't know how the user has deployed the project initially, then having a reliable method to query all project pods becomes a handy tool.
Configuration Updates:
Added the
appNamefield toManagerConfig,DaemonSetTemplateData,ResourceClaimTemplateTemplateData, and other relevant structs to store the value for the new Kubernetes label. (cmd/compute-domain-controller/controller.go[1]cmd/compute-domain-controller/daemonset.go[2]cmd/compute-domain-controller/resourceclaimtemplate.go[3]cmd/gpu-kubelet-plugin/main.go[4]cmd/gpu-kubelet-plugin/sharing.go[5]Updated CLI flags in
main.gofiles to allow users to specify theappNamevalue via the--chart-nameor--app-nameflag, defaulting tonvidia-dra-driver-gpu. (cmd/compute-domain-controller/main.go[1]cmd/gpu-kubelet-plugin/main.go[2]Template Modifications:
app.kubernetes.io/namelabel to metadata sections in Kubernetes resource templates, such as DaemonSet, ResourceClaimTemplate, and MPS Control Daemon templates. (templates/compute-domain-daemon-claim-template.tmpl.yaml[1]templates/compute-domain-daemon.tmpl.yaml[2]templates/compute-domain-workload-claim-template.tmpl.yaml[3]templates/mps-control-daemon.tmpl.yaml[4]Helm Chart Updates:
Chart.Namevalue as theHELM_CHART_NAMEenvironment variable, ensuring theapp.kubernetes.io/namelabel is populated dynamically. (deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml[1]deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml[2]These changes collectively improve Kubernetes resource labeling, making it easier to identify and manage resources deployed by the application.