Implement Prometheus metrics for gpu-kubelet-plugin, compute-domain-plugin, and dra-controller by shengnuo · Pull Request #964 · kubernetes-sigs/dra-driver-nvidia-gpu

shengnuo · 2026-03-25T16:09:39Z

Fixes #352

Metrics prefix

All DRA Prometheus metrics use prefix nvidia_gpu (exported names are nvidia_gpu_dra_<name>).

Kubelet Plugins (GPU & ComputeDomain)

Metric	Type	Labels	Note
`requests_total`	Counter	`driver`, `operation`	Count of prepare/unprepare requests
`request_duration_seconds`	Histogram	`driver`, `operation`	Latency; uses exponential bucket `0.05*2^n, 0<=n<=8`
`requests_inflight`	Gauge	`driver`, `operation`	Concurrent prepare/unprepare
`prepared_devices`	Gauge	`node`, `driver`, `device_type`	Current prepared devices from the checkpoint file
`node_prepare_errors_total`	Counter	`driver`, `error_type`	Node prepare failures (plugin-scoped)
`node_unprepare_errors_total`	Counter	`driver`, `error_type`	Node unprepare failures

Compute-domain controller

Metric	Type	Labels	Note
`compute_domain_info`	GaugeVec	`status`	Expose `ComputeDomain` objects by status, from informer cache

Sample

Sample metrics on the GPU kubelet plugin after preparing and unpreparing a GPU and a MIG device

# HELP nvidia_gpu_dra_prepared_devices Current number of prepared devices by device type.
# TYPE nvidia_gpu_dra_prepared_devices gauge
nvidia_gpu_dra_prepared_devices{device_type="gpu",driver="gpu.nvidia.com",node="nim-operator-9vg0z43"} 0
nvidia_gpu_dra_prepared_devices{device_type="mig",driver="gpu.nvidia.com",node="nim-operator-9vg0z43"} 0
nvidia_gpu_dra_prepared_devices{device_type="unknown",driver="gpu.nvidia.com",node="nim-operator-9vg0z43"} 0
nvidia_gpu_dra_prepared_devices{device_type="vfio",driver="gpu.nvidia.com",node="nim-operator-9vg0z43"} 0
# HELP nvidia_gpu_dra_request_duration_seconds Duration of DRA prepare and unprepare requests.
# TYPE nvidia_gpu_dra_request_duration_seconds histogram
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.05"} 0
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.1"} 1
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.2"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.4"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.8"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="1.6"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="3.2"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="6.4"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="12.8"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="+Inf"} 2
nvidia_gpu_dra_request_duration_seconds_sum{driver="gpu.nvidia.com",operation="prepare"} 0.19524701700000002
nvidia_gpu_dra_request_duration_seconds_count{driver="gpu.nvidia.com",operation="prepare"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.05"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.1"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.2"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.4"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.8"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="1.6"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="3.2"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="6.4"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="12.8"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="+Inf"} 2
nvidia_gpu_dra_request_duration_seconds_sum{driver="gpu.nvidia.com",operation="unprepare"} 0.004752574
nvidia_gpu_dra_request_duration_seconds_count{driver="gpu.nvidia.com",operation="unprepare"} 2
# HELP nvidia_gpu_dra_requests_inflight Number of in-flight DRA prepare and unprepare requests.
# TYPE nvidia_gpu_dra_requests_inflight gauge
nvidia_gpu_dra_requests_inflight{driver="gpu.nvidia.com",operation="prepare"} 0
nvidia_gpu_dra_requests_inflight{driver="gpu.nvidia.com",operation="unprepare"} 0
# HELP nvidia_gpu_dra_requests_total Total number of DRA prepare and unprepare requests.
# TYPE nvidia_gpu_dra_requests_total counter
nvidia_gpu_dra_requests_total{driver="gpu.nvidia.com",operation="prepare"} 2
nvidia_gpu_dra_requests_total{driver="gpu.nvidia.com",operation="unprepare"} 2

copy-pr-bot · 2026-03-25T16:09:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shivamerla

Great work! left some comments. Thanks.

jgehrcke · 2026-03-30T12:42:57Z

All DRA Prometheus metrics use prefix nvidia_gpu_dra (exported names are nvidia_gpu_dra_).

Iterating on user-facing metric names is really important before a first release. Hence, I'd like to ask a few questions (I am sure there are more):

Does nvidia need to be part of the prefix (also in view of the upcoming donation -- the answer is probably "yes")?
What really is the common prefix for all metrics emitted by anything in this DRA driver? Should this maybe just be nvidia_dra?
If we want "gpu" as part of the prefix: do we want the prefix to be nvidia_dra_gpu (sorting by specificity)?
If we want "gpu" as part of the prefix -- does this apply to CD-related metrics as well (would we see something like nvidia_dra_gpu_cd_<...>?) My gut feeling is that we should have one metric prefix for each plugin; and that could mean we would end up using nvidia_dra_gpu and nvidia_dra_cd -- which then raises the question of how to prefix controller-emitted metrics.

Just thinking out loud! This needs a bit of brainstorm between a number of people that ask critical questions, and we also need to look at different variants side by side. Bad naming choices will either stick around forever, or create quite a lot of migration pain later on.

jgehrcke · 2026-03-30T12:46:19Z

compute_domain_daemon_peer_nodes

What does this really measure and why is this useful? Which labels did you consider adding to this metric, and why?

jgehrcke · 2026-03-30T12:59:42Z

As a general note, let's make sure to read https://prometheus.io/docs/practices/naming :-) (before trying to construct metric names).

compute_domains GaugeVec status Expose ComputeDomain objects by status, from informer cache

Why would this be useful? Let's think about this end to end: let's construct a query-time (dashboarding/alerting) use case that demonstrates value (there may be one, but let's make it obvious).

About "by status": which status do you refer to?

About "Expose ComputeDomain objects": what do you mean with that?

Should this gauge reflect a current count?

compute_domain_cliques_total Gauge (none) Number of ComputeDomainClique CRs in informer cache

Same here: let's discuss the primary use case.

compute_daemons_per_clique

Same here.

When I look at the name compute_daemons I think it's also rather apparent that we need to think about CD-related prefix which should maybe just be cd.

I think we maybe should refrain from adding CD daemon/controller-emitted metrics here in this first patch. The kubelet plugin metrics you have listed all seem to be rather sane, but there are so many questions about the currently proposed metrics emitted by CD daemon/controller. I think this patch may land faster if we keep it focused on the basic plugin metrics.

shivamerla · 2026-04-07T12:10:38Z

@shengnuo Could you share a few examples of the metrics being collected, along with any sample dashboards (if available)?

Signed-off-by: Sheng Lin <shelin@nvidia.com>

shengnuo · 2026-04-08T19:58:26Z

@shivamerla Updated the PR description with sample Prometheus metrics

shivamerla · 2026-04-08T20:41:25Z

/lgtm
/approve

k8s-ci-robot · 2026-04-08T20:41:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shengnuo, shivamerla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [shivamerla]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shengnuo force-pushed the prometheus-metrics branch 3 times, most recently from bd4d808 to 5a94361 Compare March 25, 2026 16:32

visheshtanksale reviewed Mar 25, 2026

View reviewed changes

Comment thread cmd/gpu-kubelet-plugin/main.go Outdated

Comment thread cmd/gpu-kubelet-plugin/main.go Outdated

Comment thread pkg/metrics/dra_requests.go Outdated

visheshtanksale reviewed Mar 25, 2026

View reviewed changes

Comment thread cmd/compute-domain-controller/cdclique.go Outdated

visheshtanksale reviewed Mar 25, 2026

View reviewed changes

Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated

visheshtanksale reviewed Mar 25, 2026

View reviewed changes

Comment thread pkg/metrics/dra_requests.go Outdated

shengnuo force-pushed the prometheus-metrics branch 6 times, most recently from 2974ec5 to 2d461d3 Compare March 25, 2026 22:39

shivamerla reviewed Mar 28, 2026

View reviewed changes

Comment thread pkg/metrics/dra_requests.go

shivamerla reviewed Mar 28, 2026

View reviewed changes

Comment thread cmd/gpu-kubelet-plugin/driver.go

shivamerla reviewed Mar 28, 2026

View reviewed changes

Comment thread deployments/helm/nvidia-dra-driver-gpu/values.yaml Outdated

shivamerla reviewed Mar 28, 2026

View reviewed changes

guptaNswati reviewed Apr 2, 2026

View reviewed changes

Comment thread cmd/compute-domain-kubelet-plugin/driver.go Outdated

guptaNswati reviewed Apr 2, 2026

View reviewed changes

Comment thread templates/compute-domain-daemon.tmpl.yaml Outdated

shengnuo force-pushed the prometheus-metrics branch from 2d461d3 to 9001190 Compare April 2, 2026 19:56

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 2, 2026

shengnuo force-pushed the prometheus-metrics branch 3 times, most recently from d74cd23 to afebe5b Compare April 3, 2026 00:03

shengnuo force-pushed the prometheus-metrics branch from afebe5b to 6d5d05e Compare April 3, 2026 00:08

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 3, 2026

shengnuo changed the title ~~Implement Prometheus metrics for compute-domain-daemon, gpu-kubelet-plugin, compute-domain-plugin, and dra-controller~~ Implement Prometheus metrics for gpu-kubelet-plugin, compute-domain-plugin, and dra-controller Apr 6, 2026

shivamerla reviewed Apr 7, 2026

View reviewed changes

Comment thread pkg/metrics/computedomain_cluster.go Outdated

shivamerla reviewed Apr 7, 2026

View reviewed changes

Comment thread pkg/metrics/dra_requests.go Outdated

Add DRA request Prometheus metrics

0dcfe57

Signed-off-by: Sheng Lin <shelin@nvidia.com>

shengnuo force-pushed the prometheus-metrics branch from 6d5d05e to 0dcfe57 Compare April 8, 2026 19:58

k8s-ci-robot assigned shivamerla Apr 8, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 8, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2026

k8s-ci-robot merged commit 9fd0326 into kubernetes-sigs:main Apr 8, 2026
10 checks passed

Conversation

shengnuo commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Metrics prefix

Kubelet Plugins (GPU & ComputeDomain)

Compute-domain controller

Sample

Uh oh!

copy-pr-bot bot commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shivamerla left a comment

Choose a reason for hiding this comment

Uh oh!

jgehrcke commented Mar 30, 2026

Uh oh!

jgehrcke commented Mar 30, 2026

Uh oh!

jgehrcke commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shivamerla commented Apr 7, 2026

Uh oh!

shengnuo commented Apr 8, 2026

Uh oh!

shivamerla commented Apr 8, 2026

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

shengnuo commented Mar 25, 2026 •

edited

Loading

jgehrcke commented Mar 30, 2026 •

edited

Loading