Skip to content

Fix:flaky CA E2E test "shouldn't trigger additional scale-ups"#9465

Open
yashrajshuklaaa wants to merge 1 commit intokubernetes:masterfrom
yashrajshuklaaa:fix/flaky-scale-up-e2e-test-9117
Open

Fix:flaky CA E2E test "shouldn't trigger additional scale-ups"#9465
yashrajshuklaaa wants to merge 1 commit intokubernetes:masterfrom
yashrajshuklaaa:fix/flaky-scale-up-e2e-test-9117

Conversation

@yashrajshuklaaa
Copy link
Copy Markdown

Re-enable the test disabled in #9100

Root cause: unmanagedNodes was computed as nodeCount-status.ready, where nodeCount includes tainted nodes but status.ready from the CA ConfigMap excludes them. Fixed by using isNodeTainted() to count only untainted nodes, matching CAs own view of the cluster

Re-enable the test disabled in kubernetes#9100

Root cause: unmanagedNodes was computed as nodeCount-status.ready,
where nodeCount includes tainted nodes but status.ready from the
CA ConfigMap excludes them. Fixed by using isNodeTainted() to count
only untainted nodes, matching CAs own view of the cluster

Signed-off-by: Yashraj Shukla <shuklayashraj68@gmail.com>
@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-area labels Apr 8, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yashrajshuklaaa
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @yashrajshuklaaa!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 8, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @yashrajshuklaaa. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area labels Apr 8, 2026
@yashrajshuklaaa
Copy link
Copy Markdown
Author

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 8, 2026
@yashrajshuklaaa
Copy link
Copy Markdown
Author

hi @vadasambar @feiskyer pls review this

@jackfrancis
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2026
@yashrajshuklaaa
Copy link
Copy Markdown
Author

/retest

@Choraden
Copy link
Copy Markdown
Contributor

Hi @yashrajshuklaaa Thanks for the contribution.

Unfortunately, I don't believe it's right.
Notice that the nodeCount is obtained in the following way:

nodes, err := e2enode.GetReadySchedulableNodes(ctx, c)
framework.ExpectNoError(err)

if !nodeCountSet {
	// Guard the same number of schedulable nodes in every test case.
	nodeCount = len(nodes.Items)
	gomega.Expect(nodes.Items).ToNot(gomega.BeEmpty(), "Initial cluster must have at least one schedulable node")
	nodeCountSet = true
	ginkgo.By(fmt.Sprintf("Captured initial cluster size: %v", nodeCount))
}

If we take a look at the definition, we will find that the nodes are already filtered with isNodeUntainted

func GetReadySchedulableNodes(ctx context.Context, c clientset.Interface) (nodes *v1.NodeList, err error) {
	logger := klog.FromContext(ctx)
	nodes, err = checkWaitListSchedulableNodes(ctx, c)
	if err != nil {
		return nil, fmt.Errorf("listing schedulable nodes error: %w", err)
	}
	Filter(nodes, func(node v1.Node) bool {
		return IsNodeSchedulable(logger, &node) && isNodeUntainted(logger, &node)
	})
	if len(nodes.Items) == 0 {
		return nil, fmt.Errorf("there are currently no ready, schedulable nodes in the cluster")
	}
	return nodes, nil
}

So your change should basically be a no-op for the test logic.

AFAIR the flakiness lies in CA sometimes not delivering the ScaleUp Status Update via K8s Event. Feel free to investigate that direction :).
Also FYI we have temporary issues with the E2E tests, so to get more reliable feedback I suggest waiting till #9470 is merged.

@yashrajshuklaaa
Copy link
Copy Markdown
Author

Hi @Choraden
thank you for the thorough review and for pointing that out!

You're absolutely right - I missed that GetReadySchedulableNodes already filters via isNodeUntainted which makes my change effectively a no-op
i apologize for the oversight

I'll dig deeper into the CA ScaleUp Status Update delivery via K8s Events as you suggested and will also wait for #9470 to merge before re-running E2E tests to get cleaner feedback

I'll update this PR (or open a new one if the fix direction changes significantly) once I have a better understanding of the root cause
thanks again for the guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants