Fix:flaky CA E2E test "shouldn't trigger additional scale-ups"#9465
Fix:flaky CA E2E test "shouldn't trigger additional scale-ups"#9465yashrajshuklaaa wants to merge 1 commit intokubernetes:masterfrom
Conversation
Re-enable the test disabled in kubernetes#9100 Root cause: unmanagedNodes was computed as nodeCount-status.ready, where nodeCount includes tainted nodes but status.ready from the CA ConfigMap excludes them. Fixed by using isNodeTainted() to count only untainted nodes, matching CAs own view of the cluster Signed-off-by: Yashraj Shukla <shuklayashraj68@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yashrajshuklaaa The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @yashrajshuklaaa! |
|
Hi @yashrajshuklaaa. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/release-note-none |
|
hi @vadasambar @feiskyer pls review this |
|
/ok-to-test |
|
/retest |
|
Hi @yashrajshuklaaa Thanks for the contribution. Unfortunately, I don't believe it's right. nodes, err := e2enode.GetReadySchedulableNodes(ctx, c)
framework.ExpectNoError(err)
if !nodeCountSet {
// Guard the same number of schedulable nodes in every test case.
nodeCount = len(nodes.Items)
gomega.Expect(nodes.Items).ToNot(gomega.BeEmpty(), "Initial cluster must have at least one schedulable node")
nodeCountSet = true
ginkgo.By(fmt.Sprintf("Captured initial cluster size: %v", nodeCount))
}If we take a look at the definition, we will find that the nodes are already filtered with func GetReadySchedulableNodes(ctx context.Context, c clientset.Interface) (nodes *v1.NodeList, err error) {
logger := klog.FromContext(ctx)
nodes, err = checkWaitListSchedulableNodes(ctx, c)
if err != nil {
return nil, fmt.Errorf("listing schedulable nodes error: %w", err)
}
Filter(nodes, func(node v1.Node) bool {
return IsNodeSchedulable(logger, &node) && isNodeUntainted(logger, &node)
})
if len(nodes.Items) == 0 {
return nil, fmt.Errorf("there are currently no ready, schedulable nodes in the cluster")
}
return nodes, nil
}So your change should basically be a no-op for the test logic. AFAIR the flakiness lies in CA sometimes not delivering the ScaleUp Status Update via K8s Event. Feel free to investigate that direction :). |
|
Hi @Choraden You're absolutely right - I missed that I'll dig deeper into the CA ScaleUp Status Update delivery via K8s Events as you suggested and will also wait for #9470 to merge before re-running E2E tests to get cleaner feedback I'll update this PR (or open a new one if the fix direction changes significantly) once I have a better understanding of the root cause |
Re-enable the test disabled in #9100
Root cause: unmanagedNodes was computed as nodeCount-status.ready, where nodeCount includes tainted nodes but status.ready from the CA ConfigMap excludes them. Fixed by using isNodeTainted() to count only untainted nodes, matching CAs own view of the cluster