Skip to content

Conversation

@ryan-mist
Copy link
Contributor

@ryan-mist ryan-mist commented Dec 19, 2025

Fixes #N/A

Description
We see the following race condition with todays consolidation workflow:

  • Karpenter taints a consolidatable node with karpenter.sh/disrupted:NoSchedule
  • Before kube-scheduler's informer caches are up-to-date with the taint on the node, it binds a pod with do-not-disrupt annotation to the node.

This PR moves tainting before validation to catch this race condition.

This PR shifts the main intention of validation from detecting cluster churn to catching this race condition. The idea is that consolidateAfter works to similar affects as the original intention of validation since the Node cannot be consolidated for a period of time after a pod is removed (or added). For this reason, we also drop validation of re-running the scheduling simulation.

Old Disruption Order:

Get Candidates for disruption
If Validating (i.e. {singleNode, mulitNode, emptiness}Consolidation)
     Wait for 15 seconds
     Validate Candidates
     Revalidate Command [if not emptinessConsolidation]
Mark Candidates for disruption (add taint to Node and add condition to NodeClaim)
Continue ...

New Disruption Order:

Get Candidates for disruption
Mark Candidates for disruption (add taint to Node and add condition to NodeClaim)
If Validating (i.e. {singleNode, mulitNode, emptiness}Consolidation)
     Wait for 5 seconds
     Validate Candidates
Continue ...

How was this change tested?

  • make presubmit
  • manual testing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 19, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ryan-mist
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 19, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @ryan-mist. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 19, 2025
@jonathan-innis
Copy link
Member

I'm curious if there's a larger proposal about this since this seems like a fairly significant change -- won't this cause a bunch of overlaunching and sort of nullifies the validation behavior of consolidation (since no more pods will schedule to the node if it's already marked with a NoSchedule taint).

I require there being conversation about using a PreferNoSchedule for this here but NoSchedule feels like it removes the ability to do proper checking that the consolidation should actually be performed.

FWIW, I feel like if we are making this change, we should just remove the time-based validation altogether since it will be doing effectively nothing at this point.

@jonathan-innis
Copy link
Member

Can you link the race condition you are referring to from an issue?

@ryan-mist
Copy link
Contributor Author

It shifts the main intention of validation from detecting cluster churn to catching this race condition. It does still "catch" cluster churn between our decision and taint application, but this is a smaller window than before. The idea is that consolidateAfter works to similar affects as the original intention of validation since the Node cannot be conolidated for n time after a pod is removed (or added). This makes this original validation extra and not a strictly necessary check.

FWIW, I feel like if we are making this change, we should just remove the time-based validation altogether since it will be doing effectively nothing at this point.

We may not need to revalidate the command, since the window for pods to be scheduled on the node from the original simulation to tainting the Node is small. We do need the time-based validation to check that no pods with the do-not-disrupt annotation have scheduled to the Node after we apply the taint.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 23, 2025
@coveralls
Copy link

Pull Request Test Coverage Report for Build 20469504972

Details

  • 118 of 161 (73.29%) changed or added relevant lines in 8 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.02%) to 80.294%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/disruption/multinodeconsolidation.go 8 9 88.89%
pkg/controllers/disruption/singlenodeconsolidation.go 10 11 90.91%
pkg/controllers/disruption/emptiness.go 13 15 86.67%
pkg/controllers/disruption/drift.go 0 3 0.0%
pkg/controllers/disruption/staticdrift.go 0 3 0.0%
pkg/controllers/disruption/queue.go 21 30 70.0%
pkg/controllers/disruption/validation.go 26 37 70.27%
pkg/controllers/disruption/controller.go 40 53 75.47%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/validation.go 1 83.18%
pkg/controllers/disruption/queue.go 3 78.04%
Totals Coverage Status
Change from base Build 20440177454: -0.02%
Covered Lines: 11951
Relevant Lines: 14884

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants