fix: don't provision unnecessary capacity for pods which can't move to a new node #2033

saurav-agarwalla · 2025-02-26T15:47:05Z

Fixes #1842, #1928, aws/karpenter-provider-aws#7521

Description
This PR takes care of two cases where Karpenter provisions unnecessary capacity today:

Pods with karpenter.sh/do-not-disrupt=true: they aren't evictable so it makes sense for Karpenter to not consider them reschedulable as well since the only end state for these pods is either to get into a terminal state or be forcibly deleted (after a termination grace period). This prevents Karpenter from spinning up and reserving unnecessary capacity for these pods on new nodes.
Pods which can't be evicted due to fully blocking PDBs: Karpenter doesn't know how long it will take for these pods to be successfully evicted and in absence of TGP, it can take a really long time for these pods to be evicted (if at all). When this happens, Karpenter shouldn't provision unnecessary capacity for these pods.

This doesn't handle the case where there's a PDB that transiently blocks eviction since there's more work needed to handle that case as Karpenter can't predict if the eviction blocker is temporary or permanent. If Karpenter doesn't consider a pod with a transient eviction blocker (say due to a low disruption ratio due to a PDB), it can end up creating a lot of smaller nodeclaims instead of a large nodeclaim that would fit all these pods and it could end up costing more overall especially if customers don't have consolidation enabled.

Without this change:

Karpenter will continue to bring up new nodeclaims when the original nodeclaim expires (but can't be terminated because these pods hold up the termination in the absence of a termination grace period)
Karpenter will not be able to consolidate nodeclaims nominated for these pods because these pods are never going to move to that nodeclaim

How was this change tested?
Reproduced the scenario where a nodeclaim has pods with karpenter.sh/do-not-disrupt=true and saw that Karpenter was continuously spinning up new nodeclaims after the expiry of the original nodeclaim (even though it was stuck due to these pods). It wasn't able to consolidate new nodeclaims that were nominated for these pods either.

After the change, Karpenter doesn't spin up new nodeclaims for these pods.

Did the same thing for pods whose eviction was blocked due to PDB.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

otoupin-nsesi · 2025-02-27T22:03:29Z

Could we have the same “not reschedulable” behaviour for “problematic” PDBs? They cause the same issues and have the same behaviour as the do-not-disrupt pods, a.k.a. they are not truly reschedulable and will never move (without a TGP).

For example, the following PDB:

NAMESPACE       NAME                 MIN-AVAILABLE     MAX-UNAVAILABLE     ALLOWED-DISRUPTIONS     CURRENT     DESIRED     EXPECTED 
cnpg-system     my-db-primary                    1                 n/a                       0           1           1            1

is “problematic” in a Karpenter context, causing the same behaviour you are trying to patch, so it would make sense if they are treated the same. The downside being it’s a lot more logic than just checking for do-not-disrupt annotations, and maybe it doesn’t belong here.

Similar problematic PDBs: singleton pod (replica 1, min-available 1) (likely done on purpose), and misconfigured PDBs (blocks in similar ways, but are a mistake. Could be ignored as you could argue it should fixed or detected by policies).

saurav-agarwalla · 2025-02-27T22:05:28Z

That's precisely what we discussed today: #1928 (comment)

I am exploring that option. But I want to keep these changes separate to make it easier to review and justify.

saurav-agarwalla · 2025-02-28T20:53:41Z

Pushed changes to handle this for pods which don't get evicted due to PDB violation as well since they weren't huge.

coveralls · 2025-02-28T21:10:46Z

Pull Request Test Coverage Report for Build 14391239438

Details

72 of 77 (93.51%) changed or added relevant lines in 6 files are covered.
12 unchanged lines in 4 files lost coverage.
Overall coverage decreased (-0.06%) to 81.862%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/utils/node/node.go	4	6	66.67%
pkg/controllers/disruption/helpers.go	10	13	76.92%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/disruption/helpers.go	1	86.71%
pkg/controllers/node/termination/controller.go	2	72.08%
pkg/test/expectations/expectations.go	2	94.62%
pkg/controllers/provisioning/scheduling/preferences.go	7	88.76%

Totals
Change from base Build 14346744194:	-0.06%
Covered Lines:	9848
Relevant Lines:	12030

💛 - Coveralls

saurav-agarwalla · 2025-02-28T21:18:05Z

/ok-to-test

pkg/utils/pod/scheduling.go

jonathan-innis · 2025-03-21T15:57:05Z

pkg/controllers/disruption/helpers.go

+	// Don't provision capacity for pods which aren't getting evicted due to PDB violation.
+	// Since Karpenter doesn't know when these pods will be successfully evicted, spinning up capacity until
+	// these pods are evicted is wasteful.
+	pdbs, err := pdb.NewLimits(ctx, clock, kubeClient)


Just doing a raw check on whether the pod is evictable or not is going to lead to some weird behavior I think -- there are going to be cases where pods are going to be in the middle of evicting off a node and we had considered these same pods in a previous iteration (because the PDBs weren't blocking eviction at that time). I think we can really only consider PDBs where they are persistently blocking (0 unavailable, etc.). We could maybe think about considering a time-based check for persistently blocking PDBs in the future to scope into this (e.g. if a PDB has been fully blocking for more than an hour then it's "stuck")

Just doing a raw check on whether the pod is evictable or not is going to lead to some weird behavior I think -- there are going to be cases where pods are going to be in the middle of evicting off a node and we had considered these same pods in a previous iteration (because the PDBs weren't blocking eviction at that time).

Do you mind elaborating this a little more? The way I understand this, if a pod is terminating due to eviction, we don't consider it IsReschedulable anyway. It is possible that whether a pod is reschedulable or not can somewhat quickly change due to this logic but I am not sure if that's necessarily a bad thing since eventually things will settle down. This isn't very different from when pods are blocked due to the do-not-disrupt annotation but the customer removes this annotation.

We also discussed the time-based thing in one of the community meetings. While it is doable, determining the actual value is tricky and no matter what we choose, some customers could always prefer a shorter time. But the trade-off is eventually with the downsides of the current approach so I am trying to understand that first and then we can see which approach we choose.

As we discussed offline, I have updated the PR to consider fully blocking PDBs only for now to avoid creating a lot of small nodes if we consider all PDBs.

Rather than having to do the check below with PDBs in both places, is it worthwhile to integrate the PDB check in the pod scheduling utils as well as integrate it with the ReschedulablePods call so that it's standardized?

Particularly, I think we need to make sure that we don't consider all of the candidate node pods for rescheduling if they aren't reschedulable

Done - I added a concept of IsCurrentlyReschedulable which considers the do-not-disrupt annotation and fully blocking PDBs to determine whether or not a pod is reschedulable at a given point-in-time. This was needed to keep IsReschedulable untouched as that's used for more things like determining if the node is empty which shouldn't change as part of this PR.

DerekFrank · 2025-04-07T21:28:01Z

pkg/controllers/disruption/drift.go

 	return &Drift{
 		kubeClient:  kubeClient,
 		cluster:     cluster,
 		provisioner: provisioner,
 		recorder:    recorder,
+		clock:       clock,


nit: The ordering is inconsistent

Ordering of the struct members here? Not sure if there's a convention for that, is there?

pkg/utils/pdb/pdb.go

jonathan-innis · 2025-04-07T23:22:42Z

pkg/controllers/disruption/helpers.go

+	// Don't provision capacity for pods which aren't getting evicted due to PDB violation.
+	// Since Karpenter doesn't know when these pods will be successfully evicted, spinning up capacity until
+	// these pods are evicted is wasteful.
+	pdbs, err := pdb.NewLimits(ctx, clock, kubeClient)


Rather than having to do the check below with PDBs in both places, is it worthwhile to integrate the PDB check in the pod scheduling utils as well as integrate it with the ReschedulablePods call so that it's standardized?

jonathan-innis · 2025-04-07T23:23:08Z

pkg/controllers/disruption/helpers.go

+	// Don't provision capacity for pods which aren't getting evicted due to PDB violation.
+	// Since Karpenter doesn't know when these pods will be successfully evicted, spinning up capacity until
+	// these pods are evicted is wasteful.
+	pdbs, err := pdb.NewLimits(ctx, clock, kubeClient)


Particularly, I think we need to make sure that we don't consider all of the candidate node pods for rescheduling if they aren't reschedulable

pkg/controllers/disruption/helpers.go

pkg/controllers/disruption/drift_test.go

pkg/utils/pdb/pdb.go

jonathan-innis

/lgtm
/approve

k8s-ci-robot · 2025-04-10T23:07:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonathan-innis, saurav-agarwalla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jonathan-innis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…o a new node (kubernetes-sigs#2033)

…o a new node (#2033) (#2212)

…o a new node (#2033) (#2211)

…o a new node (#2033) (#2210)

…o a new node (#2033) (#2213)

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 26, 2025

k8s-ci-robot requested review from jackfrancis and tallaxes February 26, 2025 15:47

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 26, 2025

saurav-agarwalla force-pushed the main branch from d16e35c to 9117ad4 Compare February 26, 2025 16:54

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 26, 2025

This was referenced Feb 26, 2025

Endless nodes are created after expireAfter elapse on a node in some scenarios #1842

Closed

Ensure pods that can't drain aren't considered for provisioning rescheduling #1928

Open

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 28, 2025

saurav-agarwalla changed the title ~~fix: don't mark pods with 'karpenter.sh/do-not-disrupt=true' as reschedulable~~ fix: don't provision unnecessary capacity for pods which can't move to a new node Feb 28, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 28, 2025

saurav-agarwalla force-pushed the main branch from 49c07b0 to c3b35c6 Compare March 19, 2025 18:28

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 19, 2025

jonathan-innis reviewed Mar 21, 2025

View reviewed changes

pkg/utils/pod/scheduling.go Outdated Show resolved Hide resolved

jonathan-innis reviewed Mar 21, 2025

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 7, 2025

saurav-agarwalla force-pushed the main branch from 7fcf22b to ec50e48 Compare April 7, 2025 20:13

DerekFrank reviewed Apr 7, 2025

View reviewed changes

jonathan-innis reviewed Apr 7, 2025

View reviewed changes

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 9, 2025

saurav-agarwalla added 3 commits April 9, 2025 11:19

remove unused params

a940054

skip PDB test for certain versions

0dae18f

skip PDB test for certain versions

b7bea0b

jonathan-innis reviewed Apr 9, 2025

View reviewed changes

pkg/controllers/disruption/helpers.go Show resolved Hide resolved

pkg/controllers/disruption/drift_test.go Outdated Show resolved Hide resolved

pkg/utils/pdb/pdb.go Outdated Show resolved Hide resolved

readability improvements

5b9fcb6

jonathan-innis approved these changes Apr 10, 2025

View reviewed changes

k8s-ci-robot assigned jonathan-innis Apr 10, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 10, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 10, 2025

k8s-ci-robot merged commit 3cf784a into kubernetes-sigs:main Apr 10, 2025
16 checks passed

saurav-agarwalla mentioned this pull request Apr 10, 2025

Cannot disrupt Node: state node is nominated for a pending pod aws/karpenter-provider-aws#7521

Closed

saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

567e06a

…o a new node (kubernetes-sigs#2033)

saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

4328938

…o a new node (kubernetes-sigs#2033)

saurav-agarwalla mentioned this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move to a new node (#2033) #2210

Merged

saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

addf340

…o a new node (kubernetes-sigs#2033)

saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

3fd2110

…o a new node (kubernetes-sigs#2033)

saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

3113080

…o a new node (kubernetes-sigs#2033)

saurav-agarwalla mentioned this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move to a new node (#2033) #2211

Merged

saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

8d20ae6

…o a new node (kubernetes-sigs#2033)

saurav-agarwalla mentioned this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move to a new node (#2033) #2212

Merged

saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

cf371a2

…o a new node (kubernetes-sigs#2033)

saurav-agarwalla mentioned this pull request May 9, 2025

fix: don't provision unnecessary capacity for pods which can't move to a new node (#2033) #2213

Merged

k8s-ci-robot pushed a commit that referenced this pull request May 15, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

508d98e

…o a new node (#2033) (#2212)

k8s-ci-robot pushed a commit that referenced this pull request May 15, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

94c6b6b

…o a new node (#2033) (#2211)

k8s-ci-robot pushed a commit that referenced this pull request May 15, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

75e010f

…o a new node (#2033) (#2210)

k8s-ci-robot pushed a commit that referenced this pull request May 15, 2025

fix: don't provision unnecessary capacity for pods which can't move t…

3fcbb4a

…o a new node (#2033) (#2213)

jonathan-innis mentioned this pull request May 27, 2025

Avoid overprovisioning capacity for workloads on expired nodes #1929

Closed

SmaineTF1 mentioned this pull request Jun 24, 2025

On-demand instances not being consolidated properly aws/karpenter-provider-aws#8211

Open

fix: don't provision unnecessary capacity for pods which can't move to a new node #2033

fix: don't provision unnecessary capacity for pods which can't move to a new node #2033

Uh oh!

Conversation

saurav-agarwalla commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

otoupin-nsesi commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saurav-agarwalla commented Feb 27, 2025

Uh oh!

saurav-agarwalla commented Feb 28, 2025

Uh oh!

coveralls commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 14391239438

Details

💛 - Coveralls

Uh oh!

saurav-agarwalla commented Feb 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonathan-innis left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

saurav-agarwalla commented Feb 26, 2025 •

edited

Loading

otoupin-nsesi commented Feb 27, 2025 •

edited

Loading

coveralls commented Feb 28, 2025 •

edited

Loading