Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented May 31, 2025

This patch proposes the app.kubernetes.io/name label to all managed components of both the computeDomain controller and the kubelet plugin, enabling better identification and organization of Kubernetes resources. The changes primarily involve adding support for the appName configuration, updating templates to include the new label, and modifying Helm charts to propagate the label value.

As recommended in https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/#labels we use the label app.kubernetes.io/name across all project components.

kubectl get pods -lapp.kubernetes.io/name=nvidia-dra-driver-gpu -A -ojsonpath={.items[].metadata.namespace} --ignore-not-found

Then becomes sufficient to identify all pods deployed from the project, both controllers and managed pods.

This label is of interest only in pods, as objects like the ComputeDomain CRD can be easily retrieved by filtering for it's name

kubectl get computedomain -A

Will retrieve all computeDomains in the cluster.

This is important when debugging in production environments, where we don't know how the user has deployed the project initially, then having a reliable method to query all project pods becomes a handy tool.

Configuration Updates:

  • Added the appName field to ManagerConfig, DaemonSetTemplateData, ResourceClaimTemplateTemplateData, and other relevant structs to store the value for the new Kubernetes label. (cmd/compute-domain-controller/controller.go [1] cmd/compute-domain-controller/daemonset.go [2] cmd/compute-domain-controller/resourceclaimtemplate.go [3] cmd/gpu-kubelet-plugin/main.go [4] cmd/gpu-kubelet-plugin/sharing.go [5]

  • Updated CLI flags in main.go files to allow users to specify the appName value via the --chart-name or --app-name flag, defaulting to nvidia-dra-driver-gpu. (cmd/compute-domain-controller/main.go [1] cmd/gpu-kubelet-plugin/main.go [2]

Template Modifications:

  • Added the app.kubernetes.io/name label to metadata sections in Kubernetes resource templates, such as DaemonSet, ResourceClaimTemplate, and MPS Control Daemon templates. (templates/compute-domain-daemon-claim-template.tmpl.yaml [1] templates/compute-domain-daemon.tmpl.yaml [2] templates/compute-domain-workload-claim-template.tmpl.yaml [3] templates/mps-control-daemon.tmpl.yaml [4]

Helm Chart Updates:

  • Updated Helm charts to pass the Chart.Name value as the HELM_CHART_NAME environment variable, ensuring the app.kubernetes.io/name label is populated dynamically. (deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml [1] deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml [2]

These changes collectively improve Kubernetes resource labeling, making it easier to identify and manage resources deployed by the application.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds an app: nvidia-dra-driver-gpu label across templates, Helm charts, and controller code to simplify filtering of GPU driver components.

  • Inject AppLabelKey/AppLabelValue into all Kubernetes template manifests
  • Update Helm values.yaml for controller and kubeletPlugin sections
  • Propagate and assign new appLabelKey/appLabelValue constants in controller code

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
templates/compute-domain-workload-claim-template.tmpl.yaml Added {{ .AppLabelKey }} label in metadata
templates/compute-domain-daemon.tmpl.yaml Added {{ .AppLabelKey }} label in metadata
templates/compute-domain-daemon-claim-template.tmpl.yaml Added {{ .AppLabelKey }} label in metadata
deployments/helm/nvidia-dra-driver-gpu/values.yaml Added app: nvidia-dra-driver-gpu under controller and kubeletPlugin
cmd/gpu-kubelet-plugin/driver.go Bumped import from v1beta1v1beta2
cmd/gpu-kubelet-plugin/device_state.go Bumped import from v1beta1v1beta2
cmd/compute-domain-controller/resourceclaimtemplate.go Extended ResourceClaimTemplateTemplateData with AppLabelKey/AppLabelValue
cmd/compute-domain-controller/daemonset.go Extended DaemonSetTemplateData and updated buffer/variable names
cmd/compute-domain-controller/computedomain.go Introduced appLabelKey & appLabelValue consts
Comments suppressed due to low confidence (6)

cmd/compute-domain-controller/resourceclaimtemplate.go:53

  • New fields AppLabelKey and AppLabelValue have been added but there are no existing unit tests verifying that these values are correctly injected into generated templates. Consider adding tests for template data binding.
AppLabelValue           string

templates/compute-domain-workload-claim-template.tmpl.yaml:11

  • Indentation is off by two spaces; this line should align with the surrounding labels (6 spaces before the template expression) to produce valid YAML.
{{ .AppLabelKey }}: {{ .AppLabelValue }}

templates/compute-domain-daemon.tmpl.yaml:11

  • Indentation is off by two spaces; this line should align with the surrounding labels (6 spaces before the template expression) to produce valid YAML.
{{ .AppLabelKey }}: {{ .AppLabelValue }}

templates/compute-domain-daemon-claim-template.tmpl.yaml:11

  • Indentation is off by two spaces; this line should align with the surrounding labels (6 spaces before the template expression) to produce valid YAML.
{{ .AppLabelKey }}: {{ .AppLabelValue }}

deployments/helm/nvidia-dra-driver-gpu/values.yaml:64

  • The labels: block under controller is indented only 2 spaces but should be 4 spaces to match the rest of that section.
labels:

deployments/helm/nvidia-dra-driver-gpu/values.yaml:87

  • The labels: block under kubeletPlugin is indented only 2 spaces but should be 4 spaces to match the rest of that section.
labels:

@ArangoGutierrez ArangoGutierrez force-pushed the v1beta2 branch 2 times, most recently from 910006f to b7d3b0b Compare June 2, 2025 09:05
@ArangoGutierrez ArangoGutierrez requested a review from Copilot June 2, 2025 09:12
@klueska
Copy link
Collaborator

klueska commented Jun 2, 2025

Is there a reason the existing ComputeDomainLabelKey isn't sufficient for what you are trying to do?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an "app" label (with the value "nvidia-dra-driver-gpu") to various project components to facilitate easier filtering and debugging.

  • Updated YAML templates to include the new app label.
  • Updated Helm values to add the app label to controller and kubeletPlugin sections.
  • Enhanced Go source code to pass the app label values through the necessary structures and API calls.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
templates/compute-domain-workload-claim-template.tmpl.yaml App label added to resource metadata labels.
templates/compute-domain-daemon.tmpl.yaml App label added to resource metadata labels.
templates/compute-domain-daemon-claim-template.tmpl.yaml App label added to resource metadata labels.
deployments/helm/nvidia-dra-driver-gpu/values.yaml Helm values updated to include app label for controller and kubeletPlugin.
cmd/compute-domain-controller/resourceclaimtemplate.go New fields added for managing app label data.
cmd/compute-domain-controller/daemonset.go Updated to use new app label fields and corrected variable naming for consistency.
cmd/compute-domain-controller/computedomain.go Defined app label constants used across the codebase.

@ArangoGutierrez
Copy link
Collaborator Author

ArangoGutierrez commented Jun 2, 2025

Is there a reason the existing ComputeDomainLabelKey isn't sufficient for what you are trying to do?

We want a label that is generic, for non non-ComputeDomain users as well, for users that will want the GPU bits but not the ComputeDomain part. So the label nvidia-dra-driver-gpu (pointing to the name of the git project itself) is generic and more telling. I feel applying a ComputeDomain label to components meant for GPU enablement could be misleading.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an "app" label to various project components to improve filtering during debugging. The key changes include:

  • Adding the "app" label to multiple YAML templates across compute domain and helm deployment files.
  • Introducing a CLI flag ("chart-name") to propagate the label value consistently.
  • Updating resource templating code to include the new AppLabelValue.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
templates/compute-domain-workload-claim-template.tmpl.yaml Added app label to workload claim template metadata.
templates/compute-domain-daemon.tmpl.yaml Added app label to daemon template metadata.
templates/compute-domain-daemon-claim-template.tmpl.yaml Added app label to daemon claim template metadata.
deployments/helm/nvidia-dra-driver-gpu/templates/kubeletplugin.yaml Added app label (using .Chart.Name) to kubelet plugin metadata.
deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml Added app label and HELM_CHART_NAME env variable to controller metadata.
cmd/compute-domain-controller/resourceclaimtemplate.go Introduced AppLabelValue in template data and assigned it from the chartName field.
cmd/compute-domain-controller/main.go Added a required CLI flag for chart-name with a default value.
cmd/compute-domain-controller/daemonset.go Introduced AppLabelValue in daemonset template data and assigned it from m.config.chartName.
cmd/compute-domain-controller/controller.go Updated ManagerConfig and Run method to pass chartName to computedomain managers.
cmd/compute-domain-controller/computedomain.go Minor formatting changes (addition of a newline) in constant definitions.

Copy link
Collaborator

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mps-daemon template seems to be missing

@ArangoGutierrez
Copy link
Collaborator Author

the mps-daemon template seems to be missing

Yeah on that one, there is already an app: {{ .MpsControlDaemonName }} on the template, how can we handle that, should the existing logic be renamed to a diff key name so we can have the same app:charname label in all components? If so what key value should we use here?

@ArangoGutierrez ArangoGutierrez requested a review from klueska June 2, 2025 11:19
@klueska
Copy link
Collaborator

klueska commented Jun 2, 2025

the mps-daemon template seems to be missing

Yeah on that one, there is already an app: {{ .MpsControlDaemonName }} on the template, how can we handle that, should the existing logic be renamed to a diff key name so we can have the same app:charname label in all components? If so what key value should we use here?

change it to component

@ArangoGutierrez
Copy link
Collaborator Author

the mps-daemon template seems to be missing

Yeah on that one, there is already an app: {{ .MpsControlDaemonName }} on the template, how can we handle that, should the existing logic be renamed to a diff key name so we can have the same app:charname label in all components? If so what key value should we use here?

change it to component

MPS now included

Finalizer: computeDomainFinalizer,
ComputeDomainLabelKey: computeDomainLabelKey,
ComputeDomainLabelValue: cd.UID,
AppLabelValue: m.config.chartName,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, fixed

NodeName: m.nodeName,
MpsControlDaemonNamespace: m.namespace,
MpsControlDaemonName: m.name,
HelmChartName: m.manager.config.flags.chartName,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going tow ork because you call the variable AppLabelValue in the template itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, fixed

@jgehrcke
Copy link
Collaborator

jgehrcke commented Jun 2, 2025

Thanks!

we need a label identifier so it is easier to filter when debugging

In the PR description, can you a command (input and output) that demonstrates the problem you were addressing here? Are you saying some components have certain labels set, and others do not?

I like to especially understand how these labels relate to the labels we apply via Helm templating:

helm.sh/chart: {{ include "nvidia-dra-driver-gpu.chart" . }}

Are you basically trying to bring include "nvidia-dra-driver-gpu.labels" (example) to those specs not generated from Helm templates?

// and work queue management.
type ManagerConfig struct {
// chartName is the Helm chart name to use for the app label value
chartName string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the app label value

Because I recently worked myself into the various labels we want to use here, I'd like for us to be more precise. What is the "app label"? Can we maybe just use the actual name of the label?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, further below I see a literal "app" used as label name. How standard is that?

In the Helm template logic we include the Helm chart name as the value for a label named app.kuberenetes.io/name:

app.kubernetes.io/name: {{ include "nvidia-dra-driver-gpu.name" . }}

app.kubernetes.io/name: {{ include "nvidia-dra-driver-gpu.name" . }}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How standard is that?

We use it by default in all our GPU-Operator components, is quite a common practice.

In the Helm template logic we include the Helm chart name as the value for a label named app.kuberenetes.io/name:

Yes, but this value is dynamic; use could be overwritten using Helm flags. We need a constant label to look for all our components during a debugging session, or when running the must-gather.sh support script.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this value is dynamic; use could be overwritten using Helm flags

About label value

I see. :) Even if nobody is really doing that (overriding the name -- is someone doing that? maybe!) I get your point. We want something predictable. OK!

About label name

I also see that app.kubernetes.io/name is clumsy. On the other hand, app is overused and super generic. Just for the exercise: do we have a better idea than 'app' that is also short/concise? Something that we would like better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgehrcke
Copy link
Collaborator

jgehrcke commented Jun 2, 2025

Overlapping with what I commented before, with maybe a fresh perspective:

I think this is probably super important, and I want to align our mental models here about problem definition, and solution space.

In general, we probably can describe the problem as: we have two orthogonal templating systems (Helm templates, Go templates), and they are not yet aligned in terms of which labels they apply to resources. So, we want to achieve consistency between both. Is that right? Is that what we're aiming to do here?

We're probably looking for that one command that shows all entities that were created directly or indirectly from this Helm chart. Is that your goal? If yes: that's an important goal, ack (e.g., for finding orphaned resources).

What's the command that in your opinion we'd like to run for achieving this goal?

namespace: {{ include "nvidia-dra-driver-gpu.namespace" . }}
labels:
app: {{ .Chart.Name }}
{{- include "nvidia-dra-driver-gpu.labels" . | nindent 4 }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the patch raises two interesting points:

  • include "nvidia-dra-driver-gpu.labels already includes a label with the Helm chart name. That label's name is app.kubernetes.io/name.

  • We hardcode app to be the chart name -- that is, there is no other choice. By that logic, we could remove one layer of indirection, and rename AppLabelValue to ChartName in Go code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, yes and no. Because the CharName can be overwritten with Helm flags. What we are looking at here is a constant and persistent AppLabel that we can use to filter all our components in a cluster. Regardless of how complicated the user has installed them, a constant label gives us that single point to filter for the project components.

Copy link
Collaborator

@jgehrcke jgehrcke Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you gave a response to my first bullet point above (but not to the second? it's fine, this question will come back later).

CharName can be overwritten with Helm flags. What we are looking at here is a constant and persistent AppLabel

I think you're saying that we should not use a label value that is or depends on Chart.Name. And I think you just convinced me of that. Because that's dynamic, and not predictable.

However, the patch currently contains: app: {{ .Chart.Name }}.

And now I am confused. I think it shows that it's good to (together) narrow down the exact purpose/goal that we want to achieve here!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right, how should we call that variable? AppName?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Chart.Name cannot be overridden. It is a property of the chart. That said -- once we move this as an operand under GPU operator, the .Chart.Name will be gpu-operator-- not sure if that matters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that means we will have a new entry to the assets folder on the gpu-operator repo. if so, all the assets there have the app:gpu-operator label.
but in case a user wants to deploy this stand alone, having the label as proposed in this PR still makes sense

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we move this as an operand under GPU operator, the .Chart.Name will be gpu-operator-- not sure if that matters.

That matters!

Copy link
Collaborator

@jgehrcke jgehrcke Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Chart.Name cannot be overridden.

ack -- btw, ref docs: https://helm.sh/docs/chart_template_guide/builtin_objects/

And for the record, I wondered about the helm install ... --generate-name arg which I had used, documented here: https://helm.sh/docs/helm/helm_install/

--generate-name                              generate the name (and omit the NAME parameter)

This is not the chart name (and these docs aren't great).

@ArangoGutierrez
Copy link
Collaborator Author

Overlapping with what I commented before, with maybe a fresh perspective:

I think this is probably super important, and I want to align our mental models here about problem definition, and solution space.

In general, we probably can describe the problem as: we have two orthogonal templating systems (Helm templates, Go templates), and they are not yet aligned in terms of which labels they apply to resources. So, we want to achieve consistency between both. Is that right? Is that what we're aiming to do here?

We're probably looking for that one command that shows all entities that were created directly or indirectly from this Helm chart. Is that your goal? If yes: that's an important goal, ack (e.g., for finding orphaned resources).

What's the command that in your opinion we'd like to run for achieving this goal?

The intent of this patch is to make the DRA project behave similarly to the GPU-Operator, which has the app: gpu-operator in all components

see https://github.com/NVIDIA/gpu-operator/blob/main/deployments/gpu-operator/templates/operator.yaml#L14

So you can look for the operator like

https://github.com/NVIDIA/gpu-operator/blob/c5966a3d79ebe116f1c022e1efcfa5044108d3e9/hack/must-gather.sh#L54C34-L54C52

Given that users might install the GPU-Operator and NVIDIA-DRA-Driver into a custom-named namespace, our official way of requesting debugging info from customers is asking them to run this script https://github.com/NVIDIA/gpu-operator/blob/main/hack/must-gather.sh.

The idea is to have a label we can trust that will always be set to all components of this project, regardless of how or where the users deploy the project.

@ArangoGutierrez ArangoGutierrez requested a review from jgehrcke June 3, 2025 13:46
@ArangoGutierrez
Copy link
Collaborator Author

We're probably looking for that one command that shows all entities that were created directly or indirectly from this Helm chart. Is that your goal? If yes: that's an important goal, ack (e.g., for finding orphaned resources).

Not really, as time has shown, users could deploy this without Helm, or using a very custom Helm deploy command, which would potentially overwrite our labels, as they are variable.

@jgehrcke
Copy link
Collaborator

jgehrcke commented Jun 4, 2025

Further above, I asked

In the PR description, can you a command (input and output) that demonstrates the problem you were addressing here?

and

What's the command that in your opinion we'd like to run for achieving this goal?

and we haven't written that down yet. Let's do that! I believe that this is important for us to understand what we're really doing here. We want one command that currently has unexpected output. And after the patch, it has expected output.

(this helps us to understand, for example: do we want to filter only pods, or also other objects? I was assuming that we do not only want to filter pods. In the script that you linked I see we do something like OPERATOR_NAMESPACE=$($K get pods -lapp=gpu-operator ...., and this only looks up pods -- in this case, I hope we can extract more objects than just pods; but let me know if that is overly ambitious)

@ArangoGutierrez ArangoGutierrez changed the title Add label app:nvidia-dra-driver-gpu to all project components Add label app.kubernetes.io/name:nvidia-dra-driver-gpu to all project components Jun 4, 2025
… components

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@jgehrcke
Copy link
Collaborator

jgehrcke commented Jun 5, 2025

This label is of interest only in pods, as objects like the ComputeDomain CRD can be easily retrieved by filtering for it's name

Thanks for confirming.

Many thoughts were exchanged here, I try to summarize what I have understood so far in terms of problem/solution:

(maybe I am wrong! :))

Goal / problem

I think we narrowed down the goal to:

  • A command of the form kubectl get pods -A -l<LABEL_NAME>=<LABEL_VALUE> must be able to identify precisely all pods deployed from this project.

  • That method should work in the future regardless of whether this was installed via Helm or not, and also keep working after this project has been moved under the GPU operator.

Method / solution

Let's pick a custom, but concise and expressive LABEL_NAME. Maybe part-of?

I think as LABEL_VALUE we want to keep using nvidia-dra-driver-gpu.

And then apply this to all pods.

Implementation

As pragmatic as possible (I don't think we need CLI args, or anything dynamic -- just hard-code string literals in the relevant places for now).


About this:

users could deploy this without Helm

OK. But this is more of a future-proofing aspect, right? For this PR maybe we should still assume Helm installation. Right?

That's important for potential changes that this PR makes to templates in deployments/helm/nvidia-dra-driver-gpu/templates. If this PR changes anything in there, we'd need to wonder: how do we bring that change to users that don't deploy with Helm? I don't want to think about this today.

@klueska klueska added this to the unscheduled milestone Aug 13, 2025
@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants