Add label app.kubernetes.io/name:nvidia-dra-driver-gpu to all project components #381

ArangoGutierrez · 2025-05-31T12:40:52Z

This patch proposes the app.kubernetes.io/name label to all managed components of both the computeDomain controller and the kubelet plugin, enabling better identification and organization of Kubernetes resources. The changes primarily involve adding support for the appName configuration, updating templates to include the new label, and modifying Helm charts to propagate the label value.

As recommended in https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/#labels we use the label app.kubernetes.io/name across all project components.

kubectl get pods -lapp.kubernetes.io/name=nvidia-dra-driver-gpu -A -ojsonpath={.items[].metadata.namespace} --ignore-not-found

Then becomes sufficient to identify all pods deployed from the project, both controllers and managed pods.

This label is of interest only in pods, as objects like the ComputeDomain CRD can be easily retrieved by filtering for it's name

kubectl get computedomain -A

Will retrieve all computeDomains in the cluster.

This is important when debugging in production environments, where we don't know how the user has deployed the project initially, then having a reliable method to query all project pods becomes a handy tool.

Configuration Updates:

Added the appName field to ManagerConfig, DaemonSetTemplateData, ResourceClaimTemplateTemplateData, and other relevant structs to store the value for the new Kubernetes label. (cmd/compute-domain-controller/controller.go [1] cmd/compute-domain-controller/daemonset.go [2] cmd/compute-domain-controller/resourceclaimtemplate.go [3] cmd/gpu-kubelet-plugin/main.go [4] cmd/gpu-kubelet-plugin/sharing.go [5]
Updated CLI flags in main.go files to allow users to specify the appName value via the --chart-name or --app-name flag, defaulting to nvidia-dra-driver-gpu. (cmd/compute-domain-controller/main.go [1] cmd/gpu-kubelet-plugin/main.go [2]

Template Modifications:

Added the app.kubernetes.io/name label to metadata sections in Kubernetes resource templates, such as DaemonSet, ResourceClaimTemplate, and MPS Control Daemon templates. (templates/compute-domain-daemon-claim-template.tmpl.yaml [1] templates/compute-domain-daemon.tmpl.yaml [2] templates/compute-domain-workload-claim-template.tmpl.yaml [3] templates/mps-control-daemon.tmpl.yaml [4]

Helm Chart Updates:

Updated Helm charts to pass the Chart.Name value as the HELM_CHART_NAME environment variable, ensuring the app.kubernetes.io/name label is populated dynamically. (deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml [1] deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml [2]

These changes collectively improve Kubernetes resource labeling, making it easier to identify and manage resources deployed by the application.

Copilot

Pull Request Overview

Adds an app: nvidia-dra-driver-gpu label across templates, Helm charts, and controller code to simplify filtering of GPU driver components.

Inject AppLabelKey/AppLabelValue into all Kubernetes template manifests
Update Helm values.yaml for controller and kubeletPlugin sections
Propagate and assign new appLabelKey/appLabelValue constants in controller code

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
templates/compute-domain-workload-claim-template.tmpl.yaml	Added `{{ .AppLabelKey }}` label in metadata
templates/compute-domain-daemon.tmpl.yaml	Added `{{ .AppLabelKey }}` label in metadata
templates/compute-domain-daemon-claim-template.tmpl.yaml	Added `{{ .AppLabelKey }}` label in metadata
deployments/helm/nvidia-dra-driver-gpu/values.yaml	Added `app: nvidia-dra-driver-gpu` under `controller` and `kubeletPlugin`
cmd/gpu-kubelet-plugin/driver.go	Bumped import from `v1beta1` → `v1beta2`
cmd/gpu-kubelet-plugin/device_state.go	Bumped import from `v1beta1` → `v1beta2`
cmd/compute-domain-controller/resourceclaimtemplate.go	Extended `ResourceClaimTemplateTemplateData` with `AppLabelKey`/`AppLabelValue`
cmd/compute-domain-controller/daemonset.go	Extended `DaemonSetTemplateData` and updated buffer/variable names
cmd/compute-domain-controller/computedomain.go	Introduced `appLabelKey` & `appLabelValue` consts

Comments suppressed due to low confidence (6)

cmd/compute-domain-controller/resourceclaimtemplate.go:53

New fields AppLabelKey and AppLabelValue have been added but there are no existing unit tests verifying that these values are correctly injected into generated templates. Consider adding tests for template data binding.

AppLabelValue           string

templates/compute-domain-workload-claim-template.tmpl.yaml:11

Indentation is off by two spaces; this line should align with the surrounding labels (6 spaces before the template expression) to produce valid YAML.

{{ .AppLabelKey }}: {{ .AppLabelValue }}

templates/compute-domain-daemon.tmpl.yaml:11

Indentation is off by two spaces; this line should align with the surrounding labels (6 spaces before the template expression) to produce valid YAML.

{{ .AppLabelKey }}: {{ .AppLabelValue }}

templates/compute-domain-daemon-claim-template.tmpl.yaml:11

Indentation is off by two spaces; this line should align with the surrounding labels (6 spaces before the template expression) to produce valid YAML.

{{ .AppLabelKey }}: {{ .AppLabelValue }}

deployments/helm/nvidia-dra-driver-gpu/values.yaml:64

The labels: block under controller is indented only 2 spaces but should be 4 spaces to match the rest of that section.

labels:

deployments/helm/nvidia-dra-driver-gpu/values.yaml:87

The labels: block under kubeletPlugin is indented only 2 spaces but should be 4 spaces to match the rest of that section.

labels:

klueska · 2025-06-02T09:12:10Z

Is there a reason the existing ComputeDomainLabelKey isn't sufficient for what you are trying to do?

Copilot

Pull Request Overview

This PR adds an "app" label (with the value "nvidia-dra-driver-gpu") to various project components to facilitate easier filtering and debugging.

Updated YAML templates to include the new app label.
Updated Helm values to add the app label to controller and kubeletPlugin sections.
Enhanced Go source code to pass the app label values through the necessary structures and API calls.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
templates/compute-domain-workload-claim-template.tmpl.yaml	App label added to resource metadata labels.
templates/compute-domain-daemon.tmpl.yaml	App label added to resource metadata labels.
templates/compute-domain-daemon-claim-template.tmpl.yaml	App label added to resource metadata labels.
deployments/helm/nvidia-dra-driver-gpu/values.yaml	Helm values updated to include app label for controller and kubeletPlugin.
cmd/compute-domain-controller/resourceclaimtemplate.go	New fields added for managing app label data.
cmd/compute-domain-controller/daemonset.go	Updated to use new app label fields and corrected variable naming for consistency.
cmd/compute-domain-controller/computedomain.go	Defined app label constants used across the codebase.

ArangoGutierrez · 2025-06-02T09:16:36Z

Is there a reason the existing ComputeDomainLabelKey isn't sufficient for what you are trying to do?

We want a label that is generic, for non non-ComputeDomain users as well, for users that will want the GPU bits but not the ComputeDomain part. So the label nvidia-dra-driver-gpu (pointing to the name of the git project itself) is generic and more telling. I feel applying a ComputeDomain label to components meant for GPU enablement could be misleading.

cmd/compute-domain-controller/daemonset.go

deployments/helm/nvidia-dra-driver-gpu/values.yaml

cmd/compute-domain-controller/computedomain.go

cmd/compute-domain-controller/controller.go

cmd/compute-domain-controller/main.go

cmd/compute-domain-controller/resourceclaimtemplate.go

deployments/helm/nvidia-dra-driver-gpu/values.yaml

Copilot

Pull Request Overview

This PR adds an "app" label to various project components to improve filtering during debugging. The key changes include:

Adding the "app" label to multiple YAML templates across compute domain and helm deployment files.
Introducing a CLI flag ("chart-name") to propagate the label value consistently.
Updating resource templating code to include the new AppLabelValue.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
templates/compute-domain-workload-claim-template.tmpl.yaml	Added app label to workload claim template metadata.
templates/compute-domain-daemon.tmpl.yaml	Added app label to daemon template metadata.
templates/compute-domain-daemon-claim-template.tmpl.yaml	Added app label to daemon claim template metadata.
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml	Added app label (using .Chart.Name) to kubelet plugin metadata.
deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml	Added app label and HELM_CHART_NAME env variable to controller metadata.
cmd/compute-domain-controller/resourceclaimtemplate.go	Introduced AppLabelValue in template data and assigned it from the chartName field.
cmd/compute-domain-controller/main.go	Added a required CLI flag for chart-name with a default value.
cmd/compute-domain-controller/daemonset.go	Introduced AppLabelValue in daemonset template data and assigned it from m.config.chartName.
cmd/compute-domain-controller/controller.go	Updated ManagerConfig and Run method to pass chartName to computedomain managers.
cmd/compute-domain-controller/computedomain.go	Minor formatting changes (addition of a newline) in constant definitions.

deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml

deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml

cmd/compute-domain-controller/computedomain.go

klueska

the mps-daemon template seems to be missing

ArangoGutierrez · 2025-06-02T10:57:00Z

the mps-daemon template seems to be missing

Yeah on that one, there is already an app: {{ .MpsControlDaemonName }} on the template, how can we handle that, should the existing logic be renamed to a diff key name so we can have the same app:charname label in all components? If so what key value should we use here?

klueska · 2025-06-02T11:36:56Z

the mps-daemon template seems to be missing

Yeah on that one, there is already an app: {{ .MpsControlDaemonName }} on the template, how can we handle that, should the existing logic be renamed to a diff key name so we can have the same app:charname label in all components? If so what key value should we use here?

change it to component

ArangoGutierrez · 2025-06-02T11:50:59Z

the mps-daemon template seems to be missing

Yeah on that one, there is already an app: {{ .MpsControlDaemonName }} on the template, how can we handle that, should the existing logic be renamed to a diff key name so we can have the same app:charname label in all components? If so what key value should we use here?

change it to component

MPS now included

klueska · 2025-06-02T12:15:13Z

cmd/compute-domain-controller/daemonset.go

 		Finalizer:                 computeDomainFinalizer,
 		ComputeDomainLabelKey:     computeDomainLabelKey,
 		ComputeDomainLabelValue:   cd.UID,
+		AppLabelValue:             m.config.chartName,


Why call this AppLabelValue here, but HelmChartValue in:
https://github.com/NVIDIA/k8s-dra-driver-gpu/pull/381/files#diff-eb2c222c2a3789dcc8b91ee29a09d7e4c0f7e287b8331e7df07767d1661da7d6R204

My bad, fixed

klueska · 2025-06-02T12:16:00Z

cmd/gpu-kubelet-plugin/sharing.go

 		NodeName:                        m.nodeName,
 		MpsControlDaemonNamespace:       m.namespace,
 		MpsControlDaemonName:            m.name,
+		HelmChartName:                   m.manager.config.flags.chartName,


This isn't going tow ork because you call the variable AppLabelValue in the template itself.

My bad, fixed

jgehrcke · 2025-06-02T12:52:15Z

Thanks!

we need a label identifier so it is easier to filter when debugging

In the PR description, can you a command (input and output) that demonstrates the problem you were addressing here? Are you saying some components have certain labels set, and others do not?

I like to especially understand how these labels relate to the labels we apply via Helm templating:

k8s-dra-driver-gpu/deployments/helm/nvidia-dra-driver-gpu/templates/_helpers.tpl

Line 52 in 7ded33d

helm.sh/chart: {{ include "nvidia-dra-driver-gpu.chart" . }}

Are you basically trying to bring include "nvidia-dra-driver-gpu.labels" (example) to those specs not generated from Helm templates?

jgehrcke · 2025-06-02T12:57:02Z

cmd/compute-domain-controller/controller.go

 // and work queue management.
 type ManagerConfig struct {
+	// chartName is the Helm chart name to use for the app label value
+	chartName string


for the app label value

Because I recently worked myself into the various labels we want to use here, I'd like for us to be more precise. What is the "app label"? Can we maybe just use the actual name of the label?

Oh, further below I see a literal "app" used as label name. How standard is that?

In the Helm template logic we include the Helm chart name as the value for a label named app.kuberenetes.io/name:

app.kubernetes.io/name: {{ include "nvidia-dra-driver-gpu.name" . }}

k8s-dra-driver-gpu/deployments/helm/nvidia-dra-driver-gpu/templates/_helpers.tpl

Line 65 in 7ded33d

app.kubernetes.io/name: {{ include "nvidia-dra-driver-gpu.name" . }}

How standard is that?

We use it by default in all our GPU-Operator components, is quite a common practice.

In the Helm template logic we include the Helm chart name as the value for a label named app.kuberenetes.io/name:

Yes, but this value is dynamic; use could be overwritten using Helm flags. We need a constant label to look for all our components during a debugging session, or when running the must-gather.sh support script.

Yes, but this value is dynamic; use could be overwritten using Helm flags

About label value

I see. :) Even if nobody is really doing that (overriding the name -- is someone doing that? maybe!) I get your point. We want something predictable. OK!

About label name

I also see that app.kubernetes.io/name is clumsy. On the other hand, app is overused and super generic. Just for the exercise: do we have a better idea than 'app' that is also short/concise? Something that we would like better?

https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/#labels recommended labels

jgehrcke · 2025-06-02T14:13:38Z

Overlapping with what I commented before, with maybe a fresh perspective:

I think this is probably super important, and I want to align our mental models here about problem definition, and solution space.

In general, we probably can describe the problem as: we have two orthogonal templating systems (Helm templates, Go templates), and they are not yet aligned in terms of which labels they apply to resources. So, we want to achieve consistency between both. Is that right? Is that what we're aiming to do here?

We're probably looking for that one command that shows all entities that were created directly or indirectly from this Helm chart. Is that your goal? If yes: that's an important goal, ack (e.g., for finding orphaned resources).

What's the command that in your opinion we'd like to run for achieving this goal?

jgehrcke · 2025-06-02T14:30:35Z

deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml

  namespace: {{ include "nvidia-dra-driver-gpu.namespace" . }}
  labels:
+    app: {{ .Chart.Name }}
    {{- include "nvidia-dra-driver-gpu.labels" . | nindent 4 }}


This part of the patch raises two interesting points:

include "nvidia-dra-driver-gpu.labels already includes a label with the Helm chart name. That label's name is app.kubernetes.io/name.

We hardcode app to be the chart name -- that is, there is no other choice. By that logic, we could remove one layer of indirection, and rename AppLabelValue to ChartName in Go code.

Umm, yes and no. Because the CharName can be overwritten with Helm flags. What we are looking at here is a constant and persistent AppLabel that we can use to filter all our components in a cluster. Regardless of how complicated the user has installed them, a constant label gives us that single point to filter for the project components.

I think you gave a response to my first bullet point above (but not to the second? it's fine, this question will come back later).

CharName can be overwritten with Helm flags. What we are looking at here is a constant and persistent AppLabel

I think you're saying that we should not use a label value that is or depends on Chart.Name. And I think you just convinced me of that. Because that's dynamic, and not predictable.

However, the patch currently contains: app: {{ .Chart.Name }}.

And now I am confused. I think it shows that it's good to (together) narrow down the exact purpose/goal that we want to achieve here!

You are totally right, how should we call that variable? AppName?

.Chart.Name cannot be overridden. It is a property of the chart. That said -- once we move this as an operand under GPU operator, the .Chart.Name will be gpu-operator-- not sure if that matters.

I guess that means we will have a new entry to the assets folder on the gpu-operator repo. if so, all the assets there have the app:gpu-operator label.
but in case a user wants to deploy this stand alone, having the label as proposed in this PR still makes sense

once we move this as an operand under GPU operator, the .Chart.Name will be gpu-operator-- not sure if that matters.

That matters!

.Chart.Name cannot be overridden.

ack -- btw, ref docs: https://helm.sh/docs/chart_template_guide/builtin_objects/

And for the record, I wondered about the helm install ... --generate-name arg which I had used, documented here: https://helm.sh/docs/helm/helm_install/

--generate-name generate the name (and omit the NAME parameter)

This is not the chart name (and these docs aren't great).

ArangoGutierrez · 2025-06-03T13:44:58Z

Overlapping with what I commented before, with maybe a fresh perspective:

I think this is probably super important, and I want to align our mental models here about problem definition, and solution space.

In general, we probably can describe the problem as: we have two orthogonal templating systems (Helm templates, Go templates), and they are not yet aligned in terms of which labels they apply to resources. So, we want to achieve consistency between both. Is that right? Is that what we're aiming to do here?

We're probably looking for that one command that shows all entities that were created directly or indirectly from this Helm chart. Is that your goal? If yes: that's an important goal, ack (e.g., for finding orphaned resources).

What's the command that in your opinion we'd like to run for achieving this goal?

The intent of this patch is to make the DRA project behave similarly to the GPU-Operator, which has the app: gpu-operator in all components

see https://github.com/NVIDIA/gpu-operator/blob/main/deployments/gpu-operator/templates/operator.yaml#L14

So you can look for the operator like

https://github.com/NVIDIA/gpu-operator/blob/c5966a3d79ebe116f1c022e1efcfa5044108d3e9/hack/must-gather.sh#L54C34-L54C52

Given that users might install the GPU-Operator and NVIDIA-DRA-Driver into a custom-named namespace, our official way of requesting debugging info from customers is asking them to run this script https://github.com/NVIDIA/gpu-operator/blob/main/hack/must-gather.sh.

The idea is to have a label we can trust that will always be set to all components of this project, regardless of how or where the users deploy the project.

ArangoGutierrez · 2025-06-04T08:24:50Z

We're probably looking for that one command that shows all entities that were created directly or indirectly from this Helm chart. Is that your goal? If yes: that's an important goal, ack (e.g., for finding orphaned resources).

Not really, as time has shown, users could deploy this without Helm, or using a very custom Helm deploy command, which would potentially overwrite our labels, as they are variable.

jgehrcke · 2025-06-04T11:59:20Z

Further above, I asked

In the PR description, can you a command (input and output) that demonstrates the problem you were addressing here?

and

What's the command that in your opinion we'd like to run for achieving this goal?

and we haven't written that down yet. Let's do that! I believe that this is important for us to understand what we're really doing here. We want one command that currently has unexpected output. And after the patch, it has expected output.

(this helps us to understand, for example: do we want to filter only pods, or also other objects? I was assuming that we do not only want to filter pods. In the script that you linked I see we do something like OPERATOR_NAMESPACE=$($K get pods -lapp=gpu-operator ...., and this only looks up pods -- in this case, I hope we can extract more objects than just pods; but let me know if that is overly ambitious)

… components Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

jgehrcke · 2025-06-05T08:50:29Z

This label is of interest only in pods, as objects like the ComputeDomain CRD can be easily retrieved by filtering for it's name

Thanks for confirming.

Many thoughts were exchanged here, I try to summarize what I have understood so far in terms of problem/solution:

(maybe I am wrong! :))

Goal / problem

I think we narrowed down the goal to:

A command of the form kubectl get pods -A -l<LABEL_NAME>=<LABEL_VALUE> must be able to identify precisely all pods deployed from this project.
That method should work in the future regardless of whether this was installed via Helm or not, and also keep working after this project has been moved under the GPU operator.

Method / solution

Let's pick a custom, but concise and expressive LABEL_NAME. Maybe part-of?

I think as LABEL_VALUE we want to keep using nvidia-dra-driver-gpu.

And then apply this to all pods.

Implementation

As pragmatic as possible (I don't think we need CLI args, or anything dynamic -- just hard-code string literals in the relevant places for now).

About this:

users could deploy this without Helm

OK. But this is more of a future-proofing aspect, right? For this PR maybe we should still assume Helm installation. Right?

That's important for potential changes that this PR makes to templates in deployments/helm/nvidia-dra-driver-gpu/templates. If this PR changes anything in there, we'd need to wonder: how do we bring that change to users that don't deploy with Helm? I don't want to think about this today.

ArangoGutierrez requested review from Copilot, jgehrcke and klueska May 31, 2025 12:40

Copilot AI reviewed May 31, 2025

View reviewed changes

ArangoGutierrez force-pushed the v1beta2 branch 2 times, most recently from 910006f to b7d3b0b Compare June 2, 2025 09:05

ArangoGutierrez requested a review from Copilot June 2, 2025 09:12

Copilot AI reviewed Jun 2, 2025

View reviewed changes