Skip to content

Implement Prometheus metrics for gpu-kubelet-plugin, compute-domain-plugin, and dra-controller#964

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
shengnuo:prometheus-metrics
Apr 8, 2026
Merged

Implement Prometheus metrics for gpu-kubelet-plugin, compute-domain-plugin, and dra-controller#964
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
shengnuo:prometheus-metrics

Conversation

@shengnuo
Copy link
Copy Markdown
Contributor

@shengnuo shengnuo commented Mar 25, 2026

Fixes #352

Metrics prefix

All DRA Prometheus metrics use prefix nvidia_gpu (exported names are nvidia_gpu_dra_<name>).


Kubelet Plugins (GPU & ComputeDomain)

Metric Type Labels Note
requests_total Counter driver, operation Count of prepare/unprepare requests
request_duration_seconds Histogram driver, operation Latency; uses exponential bucket 0.05*2^n, 0<=n<=8
requests_inflight Gauge driver, operation Concurrent prepare/unprepare
prepared_devices Gauge node, driver, device_type Current prepared devices from the checkpoint file
node_prepare_errors_total Counter driver, error_type Node prepare failures (plugin-scoped)
node_unprepare_errors_total Counter driver, error_type Node unprepare failures

Compute-domain controller

Metric Type Labels Note
compute_domain_info GaugeVec status Expose ComputeDomain objects by status, from informer cache

Sample

Sample metrics on the GPU kubelet plugin after preparing and unpreparing a GPU and a MIG device

# HELP nvidia_gpu_dra_prepared_devices Current number of prepared devices by device type.
# TYPE nvidia_gpu_dra_prepared_devices gauge
nvidia_gpu_dra_prepared_devices{device_type="gpu",driver="gpu.nvidia.com",node="nim-operator-9vg0z43"} 0
nvidia_gpu_dra_prepared_devices{device_type="mig",driver="gpu.nvidia.com",node="nim-operator-9vg0z43"} 0
nvidia_gpu_dra_prepared_devices{device_type="unknown",driver="gpu.nvidia.com",node="nim-operator-9vg0z43"} 0
nvidia_gpu_dra_prepared_devices{device_type="vfio",driver="gpu.nvidia.com",node="nim-operator-9vg0z43"} 0
# HELP nvidia_gpu_dra_request_duration_seconds Duration of DRA prepare and unprepare requests.
# TYPE nvidia_gpu_dra_request_duration_seconds histogram
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.05"} 0
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.1"} 1
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.2"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.4"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="0.8"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="1.6"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="3.2"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="6.4"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="12.8"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="prepare",le="+Inf"} 2
nvidia_gpu_dra_request_duration_seconds_sum{driver="gpu.nvidia.com",operation="prepare"} 0.19524701700000002
nvidia_gpu_dra_request_duration_seconds_count{driver="gpu.nvidia.com",operation="prepare"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.05"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.1"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.2"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.4"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="0.8"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="1.6"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="3.2"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="6.4"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="12.8"} 2
nvidia_gpu_dra_request_duration_seconds_bucket{driver="gpu.nvidia.com",operation="unprepare",le="+Inf"} 2
nvidia_gpu_dra_request_duration_seconds_sum{driver="gpu.nvidia.com",operation="unprepare"} 0.004752574
nvidia_gpu_dra_request_duration_seconds_count{driver="gpu.nvidia.com",operation="unprepare"} 2
# HELP nvidia_gpu_dra_requests_inflight Number of in-flight DRA prepare and unprepare requests.
# TYPE nvidia_gpu_dra_requests_inflight gauge
nvidia_gpu_dra_requests_inflight{driver="gpu.nvidia.com",operation="prepare"} 0
nvidia_gpu_dra_requests_inflight{driver="gpu.nvidia.com",operation="unprepare"} 0
# HELP nvidia_gpu_dra_requests_total Total number of DRA prepare and unprepare requests.
# TYPE nvidia_gpu_dra_requests_total counter
nvidia_gpu_dra_requests_total{driver="gpu.nvidia.com",operation="prepare"} 2
nvidia_gpu_dra_requests_total{driver="gpu.nvidia.com",operation="unprepare"} 2

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shengnuo shengnuo force-pushed the prometheus-metrics branch 3 times, most recently from bd4d808 to 5a94361 Compare March 25, 2026 16:32
Comment thread cmd/gpu-kubelet-plugin/main.go Outdated
Comment thread cmd/gpu-kubelet-plugin/main.go Outdated
Comment thread pkg/metrics/dra_requests.go Outdated
Comment thread cmd/compute-domain-controller/cdclique.go Outdated
Comment thread cmd/gpu-kubelet-plugin/device_state.go Outdated
Comment thread pkg/metrics/dra_requests.go Outdated
@shengnuo shengnuo force-pushed the prometheus-metrics branch 6 times, most recently from 2974ec5 to 2d461d3 Compare March 25, 2026 22:39
Comment thread pkg/metrics/dra_requests.go
Comment thread cmd/gpu-kubelet-plugin/driver.go
Comment thread deployments/helm/nvidia-dra-driver-gpu/values.yaml Outdated
Copy link
Copy Markdown
Contributor

@shivamerla shivamerla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! left some comments. Thanks.

@jgehrcke
Copy link
Copy Markdown
Contributor

All DRA Prometheus metrics use prefix nvidia_gpu_dra (exported names are nvidia_gpu_dra_).

Iterating on user-facing metric names is really important before a first release. Hence, I'd like to ask a few questions (I am sure there are more):

  • Does nvidia need to be part of the prefix (also in view of the upcoming donation -- the answer is probably "yes")?
  • What really is the common prefix for all metrics emitted by anything in this DRA driver? Should this maybe just be nvidia_dra?
  • If we want "gpu" as part of the prefix: do we want the prefix to be nvidia_dra_gpu (sorting by specificity)?
  • If we want "gpu" as part of the prefix -- does this apply to CD-related metrics as well (would we see something like nvidia_dra_gpu_cd_<...>?) My gut feeling is that we should have one metric prefix for each plugin; and that could mean we would end up using nvidia_dra_gpu and nvidia_dra_cd -- which then raises the question of how to prefix controller-emitted metrics.

Just thinking out loud! This needs a bit of brainstorm between a number of people that ask critical questions, and we also need to look at different variants side by side. Bad naming choices will either stick around forever, or create quite a lot of migration pain later on.

@jgehrcke
Copy link
Copy Markdown
Contributor

compute_domain_daemon_peer_nodes

What does this really measure and why is this useful? Which labels did you consider adding to this metric, and why?

@jgehrcke
Copy link
Copy Markdown
Contributor

jgehrcke commented Mar 30, 2026

As a general note, let's make sure to read https://prometheus.io/docs/practices/naming :-) (before trying to construct metric names).

compute_domains GaugeVec status Expose ComputeDomain objects by status, from informer cache

Why would this be useful? Let's think about this end to end: let's construct a query-time (dashboarding/alerting) use case that demonstrates value (there may be one, but let's make it obvious).

About "by status": which status do you refer to?

About "Expose ComputeDomain objects": what do you mean with that?

Should this gauge reflect a current count?

compute_domain_cliques_total Gauge (none) Number of ComputeDomainClique CRs in informer cache

Same here: let's discuss the primary use case.

compute_daemons_per_clique

Same here.

When I look at the name compute_daemons I think it's also rather apparent that we need to think about CD-related prefix which should maybe just be cd.

I think we maybe should refrain from adding CD daemon/controller-emitted metrics here in this first patch. The kubelet plugin metrics you have listed all seem to be rather sane, but there are so many questions about the currently proposed metrics emitted by CD daemon/controller. I think this patch may land faster if we keep it focused on the basic plugin metrics.

Comment thread cmd/compute-domain-kubelet-plugin/driver.go Outdated
Comment thread templates/compute-domain-daemon.tmpl.yaml Outdated
@shengnuo shengnuo force-pushed the prometheus-metrics branch from 2d461d3 to 9001190 Compare April 2, 2026 19:56
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 2, 2026
@shengnuo shengnuo force-pushed the prometheus-metrics branch 3 times, most recently from d74cd23 to afebe5b Compare April 3, 2026 00:03
@shengnuo shengnuo force-pushed the prometheus-metrics branch from afebe5b to 6d5d05e Compare April 3, 2026 00:08
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 3, 2026
@shengnuo shengnuo changed the title Implement Prometheus metrics for compute-domain-daemon, gpu-kubelet-plugin, compute-domain-plugin, and dra-controller Implement Prometheus metrics for gpu-kubelet-plugin, compute-domain-plugin, and dra-controller Apr 6, 2026
Comment thread pkg/metrics/computedomain_cluster.go Outdated
Comment thread pkg/metrics/dra_requests.go Outdated
@shivamerla
Copy link
Copy Markdown
Contributor

@shengnuo Could you share a few examples of the metrics being collected, along with any sample dashboards (if available)?

Signed-off-by: Sheng Lin <shelin@nvidia.com>
@shengnuo shengnuo force-pushed the prometheus-metrics branch from 6d5d05e to 0dcfe57 Compare April 8, 2026 19:58
@shengnuo
Copy link
Copy Markdown
Contributor Author

shengnuo commented Apr 8, 2026

@shivamerla Updated the PR description with sample Prometheus metrics

@shivamerla
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 8, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shengnuo, shivamerla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2026
@k8s-ci-robot k8s-ci-robot merged commit 9fd0326 into kubernetes-sigs:main Apr 8, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ComputeDomain: explore exposing Prometheus metrics

7 participants