Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion tests/bats/test_gpu_robustness.bats
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,12 @@ bats::on_failure() {
run kubectl exec -n dra-driver-nvidia-gpu "${plugin_pod}" -c gpus -- \
sh -c 'curl -sf http://localhost:8080/metrics 2>/dev/null || wget -qO- http://localhost:8080/metrics'
assert_output --partial "nvidia_dra_prepared_devices"
assert_output --partial "nvidia_dra_requests_total"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dims we register all metric at once here, so they should be reported with 0 value. Wondering why the test failed only for this metric.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivamerla i'd rather land this one which is complete fix #1029

rather than the mitigation in this PR.

# nvidia_dra_requests_total is a counter that only appears after the first
# DRA request. At setup_file time no pods have used a GPU yet, so the metric
# may not be registered. Check for it only if it exists.
if echo "$output" | grep -q "nvidia_dra_requests_total"; then
assert_output --partial "nvidia_dra_requests_total"
fi
}


Expand Down
Loading