tests: enhance debuggability for issue 902 by jgehrcke · Pull Request #922 · kubernetes-sigs/dra-driver-nvidia-gpu

jgehrcke · 2026-03-06T16:33:57Z

Next time when #902 happens, we need to get the resource slice contents.

I've also increased the wait time because seemingly

the health check may pass before a resource slice update was performed
in case of DynamicMIG it indeed may take a couple of seconds before the first RS update is performed

This can probably be done more robustly (improving the liveness probe I think would be the best way to achieve that). If over the weekend this change results in less failures, we have gained knowledge. If over the weekend with this change we still get a failure, we will also win something as of the added log detail.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

jgehrcke · 2026-03-06T16:35:13Z

-  local attrs=$(get_device_attrs_from_any_gpu_slice "gpu")
+  run get_device_attrs_from_any_gpu_slice "gpu"
+  assert_success
+  local attrs="$output"


This change (also in the other places) makes the test more explicitly fail during execution of get_device_attrs_from_any_gpu_slice. Without this change, the failure would be ignored as of the sub shell usage in local attrs=$(get_device_attrs_from_any_gpu_slice "gpu").

Uh, here we capture both, stderr and stdout into output (that's how run works). That breaks things because deliberately in that function we emit log output to stderr, and payload output to stdout. So, we need this instead:

local attrs attrs=$(get_device_attrs_from_any_gpu_slice "gpu")

This crashes the test when the sub shell fails, and cleanly directs log output to where it should be.

jgehrcke · 2026-03-06T16:39:57Z

Oh, something is wrong with the patch. Will look into that soon.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

jgehrcke · 2026-03-07T14:06:10Z

Okay -- let's get this in; and on Monday we may already have a strong conclusion. I believe with the newly introduced active/dynamic waiting-for-slices-to-be-created the race condition may be properly alleviated. Let's see.

jgehrcke added 2 commits March 6, 2026 16:22

tests: enhance debuggability for issue 902

0797dba

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

tests: wait more before inspecting resource slice

7ed4ac5

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

jgehrcke commented Mar 6, 2026

View reviewed changes

tests: dynamically wait for resource slices to pop up

3570640

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

jgehrcke merged commit 3fe6036 into kubernetes-sigs:main Mar 7, 2026
17 checks passed

jgehrcke self-assigned this Mar 9, 2026

jgehrcke added this to the v26.4.0 milestone Mar 9, 2026

jgehrcke added the ci/testing issue/PR related to CI and/or testing label Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: enhance debuggability for issue 902#922

tests: enhance debuggability for issue 902#922
jgehrcke merged 3 commits intokubernetes-sigs:mainfrom
jgehrcke:jp/tests-more-debug

jgehrcke commented Mar 6, 2026 •

edited

Loading

Uh oh!

jgehrcke Mar 6, 2026

Uh oh!

jgehrcke Mar 7, 2026

Uh oh!

jgehrcke commented Mar 6, 2026

Uh oh!

jgehrcke commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jgehrcke commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgehrcke Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

jgehrcke Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

jgehrcke commented Mar 6, 2026

Uh oh!

jgehrcke commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jgehrcke commented Mar 6, 2026 •

edited

Loading