fix(e2e): replace watch with polling in WaitUntilNodeReady to fix flaky node detection#8226
Open
fix(e2e): replace watch with polling in WaitUntilNodeReady to fix flaky node detection#8226
WaitUntilNodeReady to fix flaky node detection#8226Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the e2e Kubernetes helper logic to reduce flakiness in node readiness detection by replacing a watch-based implementation with polling-based readiness checks.
Changes:
- Replaced
WaitUntilNodeReady’s Kubernetes watch logic withwait.PollUntilContextTimeout+Nodes().List()polling. - Treated transient
List()errors as non-fatal to allow retries instead of immediate test failure. - Removed unused imports (
k8s.io/apimachinery/pkg/watch,github.com/stretchr/testify/require).
79dcea7 to
e118514
Compare
e118514 to
c8d994d
Compare
…laky node detection The Kubernetes API server closes watches after a random 5-10 minute timeout. `WaitUntilNodeReady` had no retry logic, so when the watch channel closed before the node appeared, the test immediately failed with "haven't appeared in k8s API server". Replace the bare watch with `wait.PollUntilContextTimeout` polling every 5s, matching the pattern used by `WaitUntilPodRunningWithRetry`. This also fixes a race where a node added between VMSS creation and watch establishment could be missed entirely. CSEs for the GPU nodes may take longer to become ready, so it is highly likely that the watch timesout before the Kubelet can register itself with the kube-apiserver. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Replace `return false, nil` with `continue` when a matched node is NotReady so the loop checks all prefix-matched nodes before retrying, fixing a regression where capacity>1 VMSS would short-circuit on the first NotReady node and miss an already-Ready sibling - Use `node.DeepCopy()` for `lastSeenNode` to avoid retaining the full `NodeList` backing array across poll iterations - Include `err` in the timeout `Fatalf` message for easier diagnosis Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Replace `PollUntilContextTimeout` with `PollUntilContextCancel` to defer timeout to the caller's context deadline instead of hardcoding 90 minutes - Fail fast on non-retryable `Forbidden`/`Unauthorized` API errors instead of silently polling until timeout Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
c8d994d to
fa79411
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
WaitUntilNodeReadyine2e/kube.goused a bare Kubernetes watch to detect when a node appeared and became ready. The Kubernetes API server closes watches after a random 5-10 minute internal timeout. When this happened, the watch channel closed, thefor rangeloop exited, and the function immediately calledt.Fatalf— with no retry. This caused flaky e2e failures, especially for GPU tests (e.g.Test_Ubuntu2404_GPUA10) where VMSS creation and CSE execution consume most of the 17-minuteTestTimeoutVMSSbudget, leaving the watch vulnerable to API server timeouts before kubelet can register the node.This PR replaces the watch with
wait.PollUntilContextTimeoutpolling every 5 seconds usingNodes().List(), matching the pattern already used byWaitUntilPodRunningWithRetryin the same file. This eliminates three issues:Listcannot miss an existing node.Listerrors are treated as non-fatal (return false, nil), so the poll retries through transient network blips.The function signature, logging format, and error messages are preserved. Unused imports (
k8s.io/apimachinery/pkg/watch,github.com/stretchr/testify/require) are removed.