-
Notifications
You must be signed in to change notification settings - Fork 294
fix: don't provision unnecessary capacity for pods which can't move to a new node #2033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d16e35c
to
9117ad4
Compare
Could we have the same “not reschedulable” behaviour for “problematic” PDBs? They cause the same issues and have the same behaviour as the For example, the following PDB:
is “problematic” in a Karpenter context, causing the same behaviour you are trying to patch, so it would make sense if they are treated the same. The downside being it’s a lot more logic than just checking for Similar problematic PDBs: singleton pod (replica 1, min-available 1) (likely done on purpose), and misconfigured PDBs (blocks in similar ways, but are a mistake. Could be ignored as you could argue it should fixed or detected by policies). |
That's precisely what we discussed today: #1928 (comment) I am exploring that option. But I want to keep these changes separate to make it easier to review and justify. |
Pushed changes to handle this for pods which don't get evicted due to PDB violation as well since they weren't huge. |
Pull Request Test Coverage Report for Build 14391239438Details
💛 - Coveralls |
/ok-to-test |
// Don't provision capacity for pods which aren't getting evicted due to PDB violation. | ||
// Since Karpenter doesn't know when these pods will be successfully evicted, spinning up capacity until | ||
// these pods are evicted is wasteful. | ||
pdbs, err := pdb.NewLimits(ctx, clock, kubeClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just doing a raw check on whether the pod is evictable or not is going to lead to some weird behavior I think -- there are going to be cases where pods are going to be in the middle of evicting off a node and we had considered these same pods in a previous iteration (because the PDBs weren't blocking eviction at that time). I think we can really only consider PDBs where they are persistently blocking (0 unavailable, etc.). We could maybe think about considering a time-based check for persistently blocking PDBs in the future to scope into this (e.g. if a PDB has been fully blocking for more than an hour then it's "stuck")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just doing a raw check on whether the pod is evictable or not is going to lead to some weird behavior I think -- there are going to be cases where pods are going to be in the middle of evicting off a node and we had considered these same pods in a previous iteration (because the PDBs weren't blocking eviction at that time).
Do you mind elaborating this a little more? The way I understand this, if a pod is terminating due to eviction, we don't consider it IsReschedulable
anyway. It is possible that whether a pod is reschedulable or not can somewhat quickly change due to this logic but I am not sure if that's necessarily a bad thing since eventually things will settle down. This isn't very different from when pods are blocked due to the do-not-disrupt
annotation but the customer removes this annotation.
We also discussed the time-based thing in one of the community meetings. While it is doable, determining the actual value is tricky and no matter what we choose, some customers could always prefer a shorter time. But the trade-off is eventually with the downsides of the current approach so I am trying to understand that first and then we can see which approach we choose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline, I have updated the PR to consider fully blocking PDBs only for now to avoid creating a lot of small nodes if we consider all PDBs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than having to do the check below with PDBs in both places, is it worthwhile to integrate the PDB check in the pod scheduling utils as well as integrate it with the ReschedulablePods
call so that it's standardized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Particularly, I think we need to make sure that we don't consider all of the candidate node pods for rescheduling if they aren't reschedulable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - I added a concept of IsCurrentlyReschedulable
which considers the do-not-disrupt
annotation and fully blocking PDBs to determine whether or not a pod is reschedulable at a given point-in-time. This was needed to keep IsReschedulable
untouched as that's used for more things like determining if the node is empty which shouldn't change as part of this PR.
pkg/controllers/disruption/drift.go
Outdated
return &Drift{ | ||
kubeClient: kubeClient, | ||
cluster: cluster, | ||
provisioner: provisioner, | ||
recorder: recorder, | ||
clock: clock, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The ordering is inconsistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ordering of the struct members here? Not sure if there's a convention for that, is there?
// Don't provision capacity for pods which aren't getting evicted due to PDB violation. | ||
// Since Karpenter doesn't know when these pods will be successfully evicted, spinning up capacity until | ||
// these pods are evicted is wasteful. | ||
pdbs, err := pdb.NewLimits(ctx, clock, kubeClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than having to do the check below with PDBs in both places, is it worthwhile to integrate the PDB check in the pod scheduling utils as well as integrate it with the ReschedulablePods
call so that it's standardized?
// Don't provision capacity for pods which aren't getting evicted due to PDB violation. | ||
// Since Karpenter doesn't know when these pods will be successfully evicted, spinning up capacity until | ||
// these pods are evicted is wasteful. | ||
pdbs, err := pdb.NewLimits(ctx, clock, kubeClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Particularly, I think we need to make sure that we don't consider all of the candidate node pods for rescheduling if they aren't reschedulable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jonathan-innis, saurav-agarwalla The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes #1842, #1928, aws/karpenter-provider-aws#7521
Description
This PR takes care of two cases where Karpenter provisions unnecessary capacity today:
karpenter.sh/do-not-disrupt=true
: they aren't evictable so it makes sense for Karpenter to not consider them reschedulable as well since the only end state for these pods is either to get into a terminal state or be forcibly deleted (after a termination grace period). This prevents Karpenter from spinning up and reserving unnecessary capacity for these pods on new nodes.This doesn't handle the case where there's a PDB that transiently blocks eviction since there's more work needed to handle that case as Karpenter can't predict if the eviction blocker is temporary or permanent. If Karpenter doesn't consider a pod with a transient eviction blocker (say due to a low disruption ratio due to a PDB), it can end up creating a lot of smaller nodeclaims instead of a large nodeclaim that would fit all these pods and it could end up costing more overall especially if customers don't have consolidation enabled.
Without this change:
How was this change tested?
Reproduced the scenario where a nodeclaim has pods with
karpenter.sh/do-not-disrupt=true
and saw that Karpenter was continuously spinning up new nodeclaims after the expiry of the original nodeclaim (even though it was stuck due to these pods). It wasn't able to consolidate new nodeclaims that were nominated for these pods either.After the change, Karpenter doesn't spin up new nodeclaims for these pods.
Did the same thing for pods whose eviction was blocked due to PDB.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.