fix(e2e): replace watch with polling in `WaitUntilNodeReady` to fix flaky node detection by surajssd · Pull Request #8226 · Azure/AgentBaker

surajssd · 2026-04-01T23:25:56Z

What this PR does / why we need it:

WaitUntilNodeReady in e2e/kube.go used a bare Kubernetes watch to detect when a node appeared and became ready. The Kubernetes API server closes watches after a random 5-10 minute internal timeout. When this happened, the watch channel closed, the for range loop exited, and the function immediately called t.Fatalf — with no retry. This caused flaky e2e failures, especially for GPU tests (e.g. Test_Ubuntu2404_GPUA10) where VMSS creation and CSE execution consume most of the 17-minute TestTimeoutVMSS budget, leaving the watch vulnerable to API server timeouts before kubelet can register the node.

This PR replaces the watch with wait.PollUntilContextTimeout polling every 5 seconds using Nodes().List(), matching the pattern already used by WaitUntilPodRunningWithRetry in the same file. This eliminates three issues:

No retry on watch close: Polling automatically retries on each interval — there's no single connection to break.
Race condition: The old watch started from "now" and could miss a node added between VMSS creation completing and the watch being established. List cannot miss an existing node.
Transient API errors: List errors are treated as non-fatal (return false, nil), so the poll retries through transient network blips.

The function signature, logging format, and error messages are preserved. Unused imports (k8s.io/apimachinery/pkg/watch, github.com/stretchr/testify/require) are removed.

Copilot

Pull request overview

This PR updates the e2e Kubernetes helper logic to reduce flakiness in node readiness detection by replacing a watch-based implementation with polling-based readiness checks.

Changes:

Replaced WaitUntilNodeReady’s Kubernetes watch logic with wait.PollUntilContextTimeout + Nodes().List() polling.
Treated transient List() errors as non-fatal to allow retries instead of immediate test failure.
Removed unused imports (k8s.io/apimachinery/pkg/watch, github.com/stretchr/testify/require).

e2e/kube.go

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

e2e/kube.go

…laky node detection The Kubernetes API server closes watches after a random 5-10 minute timeout. `WaitUntilNodeReady` had no retry logic, so when the watch channel closed before the node appeared, the test immediately failed with "haven't appeared in k8s API server". Replace the bare watch with `wait.PollUntilContextTimeout` polling every 5s, matching the pattern used by `WaitUntilPodRunningWithRetry`. This also fixes a race where a node added between VMSS creation and watch establishment could be missed entirely. CSEs for the GPU nodes may take longer to become ready, so it is highly likely that the watch timesout before the Kubelet can register itself with the kube-apiserver. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- Replace `return false, nil` with `continue` when a matched node is NotReady so the loop checks all prefix-matched nodes before retrying, fixing a regression where capacity>1 VMSS would short-circuit on the first NotReady node and miss an already-Ready sibling - Use `node.DeepCopy()` for `lastSeenNode` to avoid retaining the full `NodeList` backing array across poll iterations - Include `err` in the timeout `Fatalf` message for easier diagnosis Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

- Replace `PollUntilContextTimeout` with `PollUntilContextCancel` to defer timeout to the caller's context deadline instead of hardcoding 90 minutes - Fail fast on non-retryable `Forbidden`/`Unauthorized` API errors instead of silently polling until timeout Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Copilot AI review requested due to automatic review settings April 1, 2026 23:25

surajssd requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, timmy-wright and zachary-bailey as code owners April 1, 2026 23:25

surajssd temporarily deployed to test April 1, 2026 23:26 — with GitHub Actions Inactive

Copilot started reviewing on behalf of surajssd April 1, 2026 23:26 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

e2e/kube.go Show resolved Hide resolved

e2e/kube.go Outdated Show resolved Hide resolved

e2e/kube.go Outdated Show resolved Hide resolved

surajssd force-pushed the suraj/fix-node-ready-flake branch from 79dcea7 to e118514 Compare April 2, 2026 21:51

surajssd temporarily deployed to test April 2, 2026 21:52 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings April 3, 2026 20:21

surajssd force-pushed the suraj/fix-node-ready-flake branch from e118514 to c8d994d Compare April 3, 2026 20:21

surajssd temporarily deployed to test April 3, 2026 20:21 — with GitHub Actions Inactive

Copilot started reviewing on behalf of surajssd April 3, 2026 20:22 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

e2e/kube.go Show resolved Hide resolved

e2e/kube.go Outdated Show resolved Hide resolved

surajssd added 2 commits April 3, 2026 13:41

surajssd force-pushed the suraj/fix-node-ready-flake branch from c8d994d to fa79411 Compare April 3, 2026 20:46

surajssd temporarily deployed to test April 3, 2026 20:46 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): replace watch with polling in `WaitUntilNodeReady` to fix flaky node detection#8226

fix(e2e): replace watch with polling in `WaitUntilNodeReady` to fix flaky node detection#8226
surajssd wants to merge 3 commits intomainfrom
suraj/fix-node-ready-flake

surajssd commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

surajssd commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants