[WIP] Fix metrics test: nvidia_dra_requests_total may not exist before first use by dims · Pull Request #1028 · kubernetes-sigs/dra-driver-nvidia-gpu

dims · 2026-04-12T15:02:12Z

The nvidia_dra_requests_total Prometheus counter is only registered after the first DRA allocation request. When the metrics smoke test runs immediately after chart install — before any GPU pod has been scheduled — the counter does not appear in the /metrics output. This causes the assert_output --partial "nvidia_dra_requests_total" assertion to fail.

Make the assertion conditional: still validate the counter if it's present, but don't fail when it hasn't been observed yet. The nvidia_dra_prepared_devices gauge is always present and remains the primary assertion.

This is the root cause of the Lambda CI failure in #1025:

not ok 10 GPUs: kubelet-plugin exposes Prometheus metrics in 4384ms

Prow log

…t use The nvidia_dra_requests_total counter is only registered after the first DRA allocation request. When the metrics test runs right after chart install (before any GPU pods), the counter does not appear in the metrics output. Make the assertion conditional so it does not fail when the counter has not been observed yet. The nvidia_dra_prepared_devices gauge is always present and remains the primary assertion.

k8s-ci-robot · 2026-04-12T15:02:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dims]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dims · 2026-04-12T15:13:07Z

/assign @shivamerla

@shivamerla should we fix the root cause? this is masking a symptom

shivamerla · 2026-04-12T17:06:02Z

tests/bats/test_gpu_robustness.bats

  run kubectl exec -n dra-driver-nvidia-gpu "${plugin_pod}" -c gpus -- \
    sh -c 'curl -sf http://localhost:8080/metrics 2>/dev/null || wget -qO- http://localhost:8080/metrics'
  assert_output --partial "nvidia_dra_prepared_devices"
-  assert_output --partial "nvidia_dra_requests_total"


@dims we register all metric at once here, so they should be reported with 0 value. Wondering why the test failed only for this metric.

@shivamerla i'd rather land this one which is complete fix #1029

rather than the mitigation in this PR.

dims · 2026-04-12T20:59:44Z

/hold going to show a better way hopefully!

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Apr 12, 2026

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Apr 12, 2026

k8s-ci-robot requested review from guptaNswati and klueska April 12, 2026 15:02

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 12, 2026

k8s-ci-robot assigned shivamerla Apr 12, 2026

shivamerla reviewed Apr 12, 2026

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 12, 2026

dims mentioned this pull request Apr 12, 2026

Initialize DRA request metrics series at startup #1029

Open

dims changed the title ~~Fix metrics test: nvidia_dra_requests_total may not exist before first use~~ [WIP] Fix metrics test: nvidia_dra_requests_total may not exist before first use Apr 12, 2026

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 12, 2026

dims closed this Apr 12, 2026

github-project-automation bot moved this from Backlog to Closed in Planning Board: k8s-dra-driver-gpu Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix metrics test: nvidia_dra_requests_total may not exist before first use#1028

[WIP] Fix metrics test: nvidia_dra_requests_total may not exist before first use#1028
dims wants to merge 1 commit intokubernetes-sigs:mainfrom
dims:worktree-fix-metrics-assert

dims commented Apr 12, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Apr 12, 2026

Uh oh!

dims commented Apr 12, 2026

Uh oh!

shivamerla Apr 12, 2026

Uh oh!

dims Apr 12, 2026

Uh oh!

dims commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dims commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 12, 2026

Uh oh!

dims commented Apr 12, 2026

Uh oh!

shivamerla Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

dims Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

dims commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dims commented Apr 12, 2026 •

edited

Loading