Skip to content

fix(e2e): replace watch with polling in WaitUntilNodeReady to fix flaky node detection#8226

Open
surajssd wants to merge 3 commits intomainfrom
suraj/fix-node-ready-flake
Open

fix(e2e): replace watch with polling in WaitUntilNodeReady to fix flaky node detection#8226
surajssd wants to merge 3 commits intomainfrom
suraj/fix-node-ready-flake

Conversation

@surajssd
Copy link
Copy Markdown
Member

@surajssd surajssd commented Apr 1, 2026

What this PR does / why we need it:

WaitUntilNodeReady in e2e/kube.go used a bare Kubernetes watch to detect when a node appeared and became ready. The Kubernetes API server closes watches after a random 5-10 minute internal timeout. When this happened, the watch channel closed, the for range loop exited, and the function immediately called t.Fatalf — with no retry. This caused flaky e2e failures, especially for GPU tests (e.g. Test_Ubuntu2404_GPUA10) where VMSS creation and CSE execution consume most of the 17-minute TestTimeoutVMSS budget, leaving the watch vulnerable to API server timeouts before kubelet can register the node.

This PR replaces the watch with wait.PollUntilContextTimeout polling every 5 seconds using Nodes().List(), matching the pattern already used by WaitUntilPodRunningWithRetry in the same file. This eliminates three issues:

  • No retry on watch close: Polling automatically retries on each interval — there's no single connection to break.
  • Race condition: The old watch started from "now" and could miss a node added between VMSS creation completing and the watch being established. List cannot miss an existing node.
  • Transient API errors: List errors are treated as non-fatal (return false, nil), so the poll retries through transient network blips.

The function signature, logging format, and error messages are preserved. Unused imports (k8s.io/apimachinery/pkg/watch, github.com/stretchr/testify/require) are removed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the e2e Kubernetes helper logic to reduce flakiness in node readiness detection by replacing a watch-based implementation with polling-based readiness checks.

Changes:

  • Replaced WaitUntilNodeReady’s Kubernetes watch logic with wait.PollUntilContextTimeout + Nodes().List() polling.
  • Treated transient List() errors as non-fatal to allow retries instead of immediate test failure.
  • Removed unused imports (k8s.io/apimachinery/pkg/watch, github.com/stretchr/testify/require).

@surajssd surajssd force-pushed the suraj/fix-node-ready-flake branch from 79dcea7 to e118514 Compare April 2, 2026 21:51
Copilot AI review requested due to automatic review settings April 3, 2026 20:21
@surajssd surajssd force-pushed the suraj/fix-node-ready-flake branch from e118514 to c8d994d Compare April 3, 2026 20:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

surajssd added 2 commits April 3, 2026 13:41
…laky node detection

The Kubernetes API server closes watches after a random 5-10 minute
timeout. `WaitUntilNodeReady` had no retry logic, so when the watch
channel closed before the node appeared, the test immediately failed
with "haven't appeared in k8s API server".

Replace the bare watch with `wait.PollUntilContextTimeout` polling
every 5s, matching the pattern used by `WaitUntilPodRunningWithRetry`.

This also fixes a race where a node added between VMSS creation and
watch establishment could be missed entirely.

CSEs for the GPU nodes may take longer to become ready, so it is highly
likely that the watch timesout before the Kubelet can register itself
with the kube-apiserver.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Replace `return false, nil` with `continue` when a matched node is
  NotReady so the loop checks all prefix-matched nodes before retrying,
  fixing a regression where capacity>1 VMSS would short-circuit on the
  first NotReady node and miss an already-Ready sibling
- Use `node.DeepCopy()` for `lastSeenNode` to avoid retaining the full
  `NodeList` backing array across poll iterations
- Include `err` in the timeout `Fatalf` message for easier diagnosis

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
- Replace `PollUntilContextTimeout` with `PollUntilContextCancel` to
  defer timeout to the caller's context deadline instead of hardcoding
  90 minutes
- Fail fast on non-retryable `Forbidden`/`Unauthorized` API errors
  instead of silently polling until timeout

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants