Skip to content

fix: don't provision unnecessary capacity for pods which can't move to a new node #2033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 10, 2025

Conversation

saurav-agarwalla
Copy link
Contributor

@saurav-agarwalla saurav-agarwalla commented Feb 26, 2025

Fixes #1842, #1928, aws/karpenter-provider-aws#7521

Description
This PR takes care of two cases where Karpenter provisions unnecessary capacity today:

  1. Pods with karpenter.sh/do-not-disrupt=true: they aren't evictable so it makes sense for Karpenter to not consider them reschedulable as well since the only end state for these pods is either to get into a terminal state or be forcibly deleted (after a termination grace period). This prevents Karpenter from spinning up and reserving unnecessary capacity for these pods on new nodes.
  2. Pods which can't be evicted due to fully blocking PDBs: Karpenter doesn't know how long it will take for these pods to be successfully evicted and in absence of TGP, it can take a really long time for these pods to be evicted (if at all). When this happens, Karpenter shouldn't provision unnecessary capacity for these pods.

This doesn't handle the case where there's a PDB that transiently blocks eviction since there's more work needed to handle that case as Karpenter can't predict if the eviction blocker is temporary or permanent. If Karpenter doesn't consider a pod with a transient eviction blocker (say due to a low disruption ratio due to a PDB), it can end up creating a lot of smaller nodeclaims instead of a large nodeclaim that would fit all these pods and it could end up costing more overall especially if customers don't have consolidation enabled.

Without this change:

  • Karpenter will continue to bring up new nodeclaims when the original nodeclaim expires (but can't be terminated because these pods hold up the termination in the absence of a termination grace period)
  • Karpenter will not be able to consolidate nodeclaims nominated for these pods because these pods are never going to move to that nodeclaim

How was this change tested?
Reproduced the scenario where a nodeclaim has pods with karpenter.sh/do-not-disrupt=true and saw that Karpenter was continuously spinning up new nodeclaims after the expiry of the original nodeclaim (even though it was stuck due to these pods). It wasn't able to consolidate new nodeclaims that were nominated for these pods either.

After the change, Karpenter doesn't spin up new nodeclaims for these pods.

Did the same thing for pods whose eviction was blocked due to PDB.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 26, 2025
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 26, 2025
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 26, 2025
@otoupin-nsesi
Copy link

otoupin-nsesi commented Feb 27, 2025

Could we have the same “not reschedulable” behaviour for “problematic” PDBs? They cause the same issues and have the same behaviour as the do-not-disrupt pods, a.k.a. they are not truly reschedulable and will never move (without a TGP).

For example, the following PDB:

NAMESPACE       NAME                 MIN-AVAILABLE     MAX-UNAVAILABLE     ALLOWED-DISRUPTIONS     CURRENT     DESIRED     EXPECTED

cnpg-system     my-db-primary                    1                 n/a                       0           1           1            1

is “problematic” in a Karpenter context, causing the same behaviour you are trying to patch, so it would make sense if they are treated the same. The downside being it’s a lot more logic than just checking for do-not-disrupt annotations, and maybe it doesn’t belong here.

Similar problematic PDBs: singleton pod (replica 1, min-available 1) (likely done on purpose), and misconfigured PDBs (blocks in similar ways, but are a mistake. Could be ignored as you could argue it should fixed or detected by policies).

@saurav-agarwalla
Copy link
Contributor Author

That's precisely what we discussed today: #1928 (comment)

I am exploring that option. But I want to keep these changes separate to make it easier to review and justify.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 28, 2025
@saurav-agarwalla saurav-agarwalla changed the title fix: don't mark pods with 'karpenter.sh/do-not-disrupt=true' as reschedulable fix: don't provision unnecessary capacity for pods which can't move to a new node Feb 28, 2025
@saurav-agarwalla
Copy link
Contributor Author

Pushed changes to handle this for pods which don't get evicted due to PDB violation as well since they weren't huge.

@coveralls
Copy link

coveralls commented Feb 28, 2025

Pull Request Test Coverage Report for Build 14391239438

Details

  • 72 of 77 (93.51%) changed or added relevant lines in 6 files are covered.
  • 12 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.06%) to 81.862%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/utils/node/node.go 4 6 66.67%
pkg/controllers/disruption/helpers.go 10 13 76.92%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/helpers.go 1 86.71%
pkg/controllers/node/termination/controller.go 2 72.08%
pkg/test/expectations/expectations.go 2 94.62%
pkg/controllers/provisioning/scheduling/preferences.go 7 88.76%
Totals Coverage Status
Change from base Build 14346744194: -0.06%
Covered Lines: 9848
Relevant Lines: 12030

💛 - Coveralls

@saurav-agarwalla
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 28, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 19, 2025
// Don't provision capacity for pods which aren't getting evicted due to PDB violation.
// Since Karpenter doesn't know when these pods will be successfully evicted, spinning up capacity until
// these pods are evicted is wasteful.
pdbs, err := pdb.NewLimits(ctx, clock, kubeClient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just doing a raw check on whether the pod is evictable or not is going to lead to some weird behavior I think -- there are going to be cases where pods are going to be in the middle of evicting off a node and we had considered these same pods in a previous iteration (because the PDBs weren't blocking eviction at that time). I think we can really only consider PDBs where they are persistently blocking (0 unavailable, etc.). We could maybe think about considering a time-based check for persistently blocking PDBs in the future to scope into this (e.g. if a PDB has been fully blocking for more than an hour then it's "stuck")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just doing a raw check on whether the pod is evictable or not is going to lead to some weird behavior I think -- there are going to be cases where pods are going to be in the middle of evicting off a node and we had considered these same pods in a previous iteration (because the PDBs weren't blocking eviction at that time).

Do you mind elaborating this a little more? The way I understand this, if a pod is terminating due to eviction, we don't consider it IsReschedulable anyway. It is possible that whether a pod is reschedulable or not can somewhat quickly change due to this logic but I am not sure if that's necessarily a bad thing since eventually things will settle down. This isn't very different from when pods are blocked due to the do-not-disrupt annotation but the customer removes this annotation.

We also discussed the time-based thing in one of the community meetings. While it is doable, determining the actual value is tricky and no matter what we choose, some customers could always prefer a shorter time. But the trade-off is eventually with the downsides of the current approach so I am trying to understand that first and then we can see which approach we choose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, I have updated the PR to consider fully blocking PDBs only for now to avoid creating a lot of small nodes if we consider all PDBs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having to do the check below with PDBs in both places, is it worthwhile to integrate the PDB check in the pod scheduling utils as well as integrate it with the ReschedulablePods call so that it's standardized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly, I think we need to make sure that we don't consider all of the candidate node pods for rescheduling if they aren't reschedulable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I added a concept of IsCurrentlyReschedulable which considers the do-not-disrupt annotation and fully blocking PDBs to determine whether or not a pod is reschedulable at a given point-in-time. This was needed to keep IsReschedulable untouched as that's used for more things like determining if the node is empty which shouldn't change as part of this PR.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 7, 2025
return &Drift{
kubeClient: kubeClient,
cluster: cluster,
provisioner: provisioner,
recorder: recorder,
clock: clock,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The ordering is inconsistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering of the struct members here? Not sure if there's a convention for that, is there?

// Don't provision capacity for pods which aren't getting evicted due to PDB violation.
// Since Karpenter doesn't know when these pods will be successfully evicted, spinning up capacity until
// these pods are evicted is wasteful.
pdbs, err := pdb.NewLimits(ctx, clock, kubeClient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having to do the check below with PDBs in both places, is it worthwhile to integrate the PDB check in the pod scheduling utils as well as integrate it with the ReschedulablePods call so that it's standardized?

// Don't provision capacity for pods which aren't getting evicted due to PDB violation.
// Since Karpenter doesn't know when these pods will be successfully evicted, spinning up capacity until
// these pods are evicted is wasteful.
pdbs, err := pdb.NewLimits(ctx, clock, kubeClient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly, I think we need to make sure that we don't consider all of the candidate node pods for rescheduling if they aren't reschedulable

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 9, 2025
Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jonathan-innis, saurav-agarwalla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 10, 2025
@k8s-ci-robot k8s-ci-robot merged commit 3cf784a into kubernetes-sigs:main Apr 10, 2025
16 checks passed
saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025
saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025
saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025
saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025
saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025
saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025
saurav-agarwalla added a commit to saurav-agarwalla/karpenter that referenced this pull request May 9, 2025
k8s-ci-robot pushed a commit that referenced this pull request May 15, 2025
k8s-ci-robot pushed a commit that referenced this pull request May 15, 2025
k8s-ci-robot pushed a commit that referenced this pull request May 15, 2025
k8s-ci-robot pushed a commit that referenced this pull request May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants