Fix:flaky CA E2E test "shouldn't trigger additional scale-ups" by yashrajshuklaaa · Pull Request #9465 · kubernetes/autoscaler

yashrajshuklaaa · 2026-04-08T17:50:37Z

Re-enable the test disabled in #9100

Root cause: unmanagedNodes was computed as nodeCount-status.ready, where nodeCount includes tainted nodes but status.ready from the CA ConfigMap excludes them. Fixed by using isNodeTainted() to count only untainted nodes, matching CAs own view of the cluster

Re-enable the test disabled in kubernetes#9100 Root cause: unmanagedNodes was computed as nodeCount-status.ready, where nodeCount includes tainted nodes but status.ready from the CA ConfigMap excludes them. Fixed by using isNodeTainted() to count only untainted nodes, matching CAs own view of the cluster Signed-off-by: Yashraj Shukla <shuklayashraj68@gmail.com>

k8s-ci-robot · 2026-04-08T17:50:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yashrajshuklaaa
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-04-08T17:50:46Z

Welcome @yashrajshuklaaa!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-04-08T17:50:47Z

Hi @yashrajshuklaaa. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

yashrajshuklaaa · 2026-04-08T17:56:33Z

/release-note-none

yashrajshuklaaa · 2026-04-09T13:35:13Z

hi @vadasambar @feiskyer pls review this

jackfrancis · 2026-04-09T23:36:43Z

/ok-to-test

yashrajshuklaaa · 2026-04-10T15:00:28Z

/retest

Choraden · 2026-04-10T17:52:20Z

Hi @yashrajshuklaaa Thanks for the contribution.

Unfortunately, I don't believe it's right.
Notice that the nodeCount is obtained in the following way:

nodes, err := e2enode.GetReadySchedulableNodes(ctx, c)
framework.ExpectNoError(err)

if !nodeCountSet {
	// Guard the same number of schedulable nodes in every test case.
	nodeCount = len(nodes.Items)
	gomega.Expect(nodes.Items).ToNot(gomega.BeEmpty(), "Initial cluster must have at least one schedulable node")
	nodeCountSet = true
	ginkgo.By(fmt.Sprintf("Captured initial cluster size: %v", nodeCount))
}

If we take a look at the definition, we will find that the nodes are already filtered with isNodeUntainted

func GetReadySchedulableNodes(ctx context.Context, c clientset.Interface) (nodes *v1.NodeList, err error) {
	logger := klog.FromContext(ctx)
	nodes, err = checkWaitListSchedulableNodes(ctx, c)
	if err != nil {
		return nil, fmt.Errorf("listing schedulable nodes error: %w", err)
	}
	Filter(nodes, func(node v1.Node) bool {
		return IsNodeSchedulable(logger, &node) && isNodeUntainted(logger, &node)
	})
	if len(nodes.Items) == 0 {
		return nil, fmt.Errorf("there are currently no ready, schedulable nodes in the cluster")
	}
	return nodes, nil
}

So your change should basically be a no-op for the test logic.

AFAIR the flakiness lies in CA sometimes not delivering the ScaleUp Status Update via K8s Event. Feel free to investigate that direction :).
Also FYI we have temporary issues with the E2E tests, so to get more reliable feedback I suggest waiting till #9470 is merged.

yashrajshuklaaa · 2026-04-11T00:32:57Z

Hi @Choraden
thank you for the thorough review and for pointing that out!

You're absolutely right - I missed that GetReadySchedulableNodes already filters via isNodeUntainted which makes my change effectively a no-op
i apologize for the oversight

I'll dig deeper into the CA ScaleUp Status Update delivery via K8s Events as you suggested and will also wait for #9470 to merge before re-running E2E tests to get cleaner feedback

I'll update this PR (or open a new one if the fix direction changes significantly) once I have a better understanding of the root cause
thanks again for the guidance!

k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-area labels Apr 8, 2026

k8s-ci-robot requested review from feiskyer and vadasambar April 8, 2026 17:50

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 8, 2026

k8s-ci-robot added area/cluster-autoscaler size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area labels Apr 8, 2026

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 8, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix:flaky CA E2E test "shouldn't trigger additional scale-ups"#9465

Fix:flaky CA E2E test "shouldn't trigger additional scale-ups"#9465
yashrajshuklaaa wants to merge 1 commit intokubernetes:masterfrom
yashrajshuklaaa:fix/flaky-scale-up-e2e-test-9117

yashrajshuklaaa commented Apr 8, 2026

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

yashrajshuklaaa commented Apr 8, 2026

Uh oh!

yashrajshuklaaa commented Apr 9, 2026

Uh oh!

jackfrancis commented Apr 9, 2026

Uh oh!

yashrajshuklaaa commented Apr 10, 2026

Uh oh!

Choraden commented Apr 10, 2026

Uh oh!

yashrajshuklaaa commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yashrajshuklaaa commented Apr 8, 2026

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

yashrajshuklaaa commented Apr 8, 2026

Uh oh!

yashrajshuklaaa commented Apr 9, 2026

Uh oh!

jackfrancis commented Apr 9, 2026

Uh oh!

yashrajshuklaaa commented Apr 10, 2026

Uh oh!

Choraden commented Apr 10, 2026

Uh oh!

yashrajshuklaaa commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants