Skip to content

[WIP] Fix metrics test: nvidia_dra_requests_total may not exist before first use#1028

Closed
dims wants to merge 1 commit intokubernetes-sigs:mainfrom
dims:worktree-fix-metrics-assert
Closed

[WIP] Fix metrics test: nvidia_dra_requests_total may not exist before first use#1028
dims wants to merge 1 commit intokubernetes-sigs:mainfrom
dims:worktree-fix-metrics-assert

Conversation

@dims
Copy link
Copy Markdown
Member

@dims dims commented Apr 12, 2026

The nvidia_dra_requests_total Prometheus counter is only registered after the first DRA allocation request. When the metrics smoke test runs immediately after chart install — before any GPU pod has been scheduled — the counter does not appear in the /metrics output. This causes the assert_output --partial "nvidia_dra_requests_total" assertion to fail.

Make the assertion conditional: still validate the counter if it's present, but don't fail when it hasn't been observed yet. The nvidia_dra_prepared_devices gauge is always present and remains the primary assertion.

This is the root cause of the Lambda CI failure in #1025:

not ok 10 GPUs: kubelet-plugin exposes Prometheus metrics in 4384ms

Prow log

…t use

The nvidia_dra_requests_total counter is only registered after the first
DRA allocation request. When the metrics test runs right after chart
install (before any GPU pods), the counter does not appear in the metrics
output. Make the assertion conditional so it does not fail when the
counter has not been observed yet.

The nvidia_dra_prepared_devices gauge is always present and remains the
primary assertion.
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 12, 2026
@dims
Copy link
Copy Markdown
Member Author

dims commented Apr 12, 2026

/assign @shivamerla

@shivamerla should we fix the root cause? this is masking a symptom

run kubectl exec -n dra-driver-nvidia-gpu "${plugin_pod}" -c gpus -- \
sh -c 'curl -sf http://localhost:8080/metrics 2>/dev/null || wget -qO- http://localhost:8080/metrics'
assert_output --partial "nvidia_dra_prepared_devices"
assert_output --partial "nvidia_dra_requests_total"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dims we register all metric at once here, so they should be reported with 0 value. Wondering why the test failed only for this metric.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivamerla i'd rather land this one which is complete fix #1029

rather than the mitigation in this PR.

@dims
Copy link
Copy Markdown
Member Author

dims commented Apr 12, 2026

/hold going to show a better way hopefully!

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 12, 2026
@dims dims changed the title Fix metrics test: nvidia_dra_requests_total may not exist before first use [WIP] Fix metrics test: nvidia_dra_requests_total may not exist before first use Apr 12, 2026
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 12, 2026
@dims dims closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants