-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: don't provision unnecessary capacity for pods which can't move to a new node #2033
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: saurav-agarwalla The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
d16e35c
to
9117ad4
Compare
Could we have the same “not reschedulable” behaviour for “problematic” PDBs? They cause the same issues and have the same behaviour as the For example, the following PDB:
is “problematic” in a Karpenter context, causing the same behaviour you are trying to patch, so it would make sense if they are treated the same. The downside being it’s a lot more logic than just checking for Similar problematic PDBs: singleton pod (replica 1, min-available 1) (likely done on purpose), and misconfigured PDBs (blocks in similar ways, but are a mistake. Could be ignored as you could argue it should fixed or detected by policies). |
That's precisely what we discussed today: #1928 (comment) I am exploring that option. But I want to keep these changes separate to make it easier to review and justify. |
Pushed changes to handle this for pods which don't get evicted due to PDB violation as well since they weren't huge. |
Pull Request Test Coverage Report for Build 13596255744Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
/ok-to-test |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Fixes #1842, #1928, aws/karpenter-provider-aws#7521
Description
This PR takes care of two cases where Karpenter provisions unnecessary capacity today:
karpenter.sh/do-not-disrupt=true
: they aren't evictable so it makes sense for Karpenter to not consider them reschedulable as well since the only end state for these pods is either to get into a terminal state or be forcibly deleted (after a termination grace period). This prevents Karpenter from spinning up and reserving unnecessary capacity for these pods on new nodes.Without this change:
How was this change tested?
Reproduced the scenario where a nodeclaim has pods with
karpenter.sh/do-not-disrupt=true
and saw that Karpenter was continuously spinning up new nodeclaims after the expiry of the original nodeclaim (even though it was stuck due to these pods). It wasn't able to consolidate new nodeclaims that were nominated for these pods either.After the change, Karpenter doesn't spin up new nodeclaims for these pods.
Did the same thing for pods whose eviction was blocked due to PDB.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.