Skip to content

tests: enhance debuggability for issue 902#922

Merged
jgehrcke merged 3 commits intokubernetes-sigs:mainfrom
jgehrcke:jp/tests-more-debug
Mar 7, 2026
Merged

tests: enhance debuggability for issue 902#922
jgehrcke merged 3 commits intokubernetes-sigs:mainfrom
jgehrcke:jp/tests-more-debug

Conversation

@jgehrcke
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke commented Mar 6, 2026

Next time when #902 happens, we need to get the resource slice contents.

I've also increased the wait time because seemingly

  • the health check may pass before a resource slice update was performed
  • in case of DynamicMIG it indeed may take a couple of seconds before the first RS update is performed

This can probably be done more robustly (improving the liveness probe I think would be the best way to achieve that). If over the weekend this change results in less failures, we have gained knowledge. If over the weekend with this change we still get a failure, we will also win something as of the added log detail.

jgehrcke added 2 commits March 6, 2026 16:22
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
Comment thread tests/bats/test_gpu_basic.bats Outdated
local attrs=$(get_device_attrs_from_any_gpu_slice "gpu")
run get_device_attrs_from_any_gpu_slice "gpu"
assert_success
local attrs="$output"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change (also in the other places) makes the test more explicitly fail during execution of get_device_attrs_from_any_gpu_slice. Without this change, the failure would be ignored as of the sub shell usage in local attrs=$(get_device_attrs_from_any_gpu_slice "gpu").

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, here we capture both, stderr and stdout into output (that's how run works). That breaks things because deliberately in that function we emit log output to stderr, and payload output to stdout. So, we need this instead:

  local attrs
  attrs=$(get_device_attrs_from_any_gpu_slice "gpu")

This crashes the test when the sub shell fails, and cleanly directs log output to where it should be.

@jgehrcke
Copy link
Copy Markdown
Contributor Author

jgehrcke commented Mar 6, 2026

Oh, something is wrong with the patch. Will look into that soon.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@jgehrcke
Copy link
Copy Markdown
Contributor Author

jgehrcke commented Mar 7, 2026

Okay -- let's get this in; and on Monday we may already have a strong conclusion. I believe with the newly introduced active/dynamic waiting-for-slices-to-be-created the race condition may be properly alleviated. Let's see.

@jgehrcke jgehrcke merged commit 3fe6036 into kubernetes-sigs:main Mar 7, 2026
17 checks passed
@jgehrcke jgehrcke self-assigned this Mar 9, 2026
@jgehrcke jgehrcke added this to the v26.4.0 milestone Mar 9, 2026
@jgehrcke jgehrcke added the ci/testing issue/PR related to CI and/or testing label Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/testing issue/PR related to CI and/or testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant