Skip to content

Commit 24e034e

Browse files
committed
add down scaling and descheduling use case that complies with scheduling
constraints
1 parent 9b6a5e6 commit 24e034e

File tree

1 file changed

+54
-9
lines changed

1 file changed

+54
-9
lines changed

keps/sig-apps/4563-evacuation-api/README.md

+54-9
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
- [Story 2](#story-2)
1717
- [Story 3](#story-3)
1818
- [Story 4](#story-4)
19+
- [Story 5](#story-5)
1920
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
2021
- [Length Limitations for Pod Annotations and Evacuation Finalizers](#length-limitations-for-pod-annotations-and-evacuation-finalizers)
2122
- [Risks and Mitigations](#risks-and-mitigations)
@@ -44,6 +45,7 @@
4445
- [StatefulSet Controller](#statefulset-controller)
4546
- [DaemonSet and Static Pods](#daemonset-and-static-pods)
4647
- [HorizontalPodAutoscaler](#horizontalpodautoscaler)
48+
- [Descheduling and Downscaling](#descheduling-and-downscaling)
4749
- [Test Plan](#test-plan)
4850
- [Prerequisite testing updates](#prerequisite-testing-updates)
4951
- [Unit tests](#unit-tests)
@@ -161,21 +163,28 @@ The major issues are:
161163
possible to disrupt the pods during a high load without experiencing application downtime. If
162164
the minimum number of pods is 1, PDBs cannot be used without blocking the node drain. This has
163165
been discussed in issue [kubernetes/kubernetes#93476](https://github.com/kubernetes/kubernetes/issues/93476).
164-
3. Graceful deletion of DaemonSet pods is currently only supported as part of (Linux) graceful node
166+
3. Replicaset scaling down does not take inter-pods scheduling constraints into consideration. The
167+
current mechanism for choosing pods to terminate takes only [creation time,
168+
node rank](https://github.com/kubernetes/kubernetes/blob/cae35dba5a3060711a2a3f958537003bc74a59c0/pkg/controller/replicaset/replica_set.go#L822-L832),
169+
and [pod-deletion-cost annotation](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost)
170+
into account. This is not sufficient, and it can dis-balance the pods across the nodes as
171+
described in [kubernetes/kubernetes#124306](https://github.com/kubernetes/kubernetes/issues/124306)
172+
and [many other issues](https://github.com/kubernetes/kubernetes/issues/124306#issuecomment-2493091257).
173+
4. Descheduler does not allow postponing eviction for applications that are unable to be evicted
174+
immediately. This can result in descheduling of incorrect set of pods. This is outlined in the
175+
KEP [kubernetes-sigs/descheduler#1354](https://github.com/kubernetes-sigs/descheduler/pull/1354).
176+
5. Graceful deletion of DaemonSet pods is currently only supported as part of (Linux) graceful node
165177
shutdown. The length of the shutdown is again not application specific and is set cluster-wide
166178
(optionally by priority) by the cluster admin. This does not take into account
167179
`.spec.terminationGracePeriodSeconds` of each pod and may cause premature termination of
168180
the application. This has been discussed in issue [kubernetes/kubernetes#75482](https://github.com/kubernetes/kubernetes/issues/75482)
169181
and in issue [kubernetes-sigs/cluster-api#6158](https://github.com/kubernetes-sigs/cluster-api/issues/6158).
170-
4. Different pod termination mechanisms are not synchronized with each other. So for example, the
182+
6. Different pod termination mechanisms are not synchronized with each other. So for example, the
171183
taint manager may prematurely terminate pods that are currently under Node Graceful Shutdown.
172184
This can also happen with other mechanism (e.g., different types of evictions). This has been
173185
discussed in the issue [kubernetes/kubernetes#124448](https://github.com/kubernetes/kubernetes/issues/124448)
174186
and in the issue [kubernetes/kubernetes#72129](https://github.com/kubernetes/kubernetes/issues/72129).
175-
5. Descheduler does not allow postponing eviction for applications that are unable to be evicted
176-
immediately. This can result in descheduling of incorrect set of pods. This is outlined in the
177-
KEP [kubernetes-sigs/descheduler#1354](https://github.com/kubernetes-sigs/descheduler/pull/1354).
178-
6. [Affinity Based Eviction](https://github.com/kubernetes/enhancements/issues/4328) is an upcoming
187+
7. [Affinity Based Eviction](https://github.com/kubernetes/enhancements/issues/4328) is an upcoming
179188
feature that would like to introduce the `RequiredDuringSchedulingRequiredDuringExecution`
180189
nodeAffinity option to remove pods from nodes that do not match this affinity. The controller
181190
proposed by this feature would like to use the Evacuation API for the disruption safety and
@@ -331,11 +340,17 @@ be able to identify other evacuators and an order in which they will run.
331340

332341
#### Story 4
333342

334-
As an application owner I want my pods to be scheduled on correct nodes. I want to use the
343+
As an application owner, I want my pods to be scheduled on correct nodes. I want to use the
335344
descheduler or the upcoming Affinity Based Eviction feature to remove pods from incorrect nodes
336345
and then have the pods scheduled on new ones. I want to do the rescheduling gracefully and be able
337346
to control the disruption level of my application (even 0% application unavailability).
338347

348+
#### Story 5
349+
350+
As an application owner, I run a large Deployment with pods that utilize TopologySpreadConstraints.
351+
I want to downscale such a Deployment so that these constraints are preserved and the pods are
352+
correctly balanced across the nodes.
353+
339354
### Notes/Constraints/Caveats (Optional)
340355

341356
#### Length Limitations for Pod Annotations and Evacuation Finalizers
@@ -437,8 +452,8 @@ exists, the instigator should still add itself to the finalizers. The finalizers
437452
If the evacuation is no longer needed, the instigator should remove itself from the finalizers.
438453
The evacuation will be then deleted by the evacuation controller. In case the evacuator has set
439454
`.status.evacuationCancellationPolicy` to `Forbid`, the evacuation process cannot be cancelled, and
440-
the evacuation controller will wait to delete the pod until the pod has been terminated and removed
441-
from etcd.
455+
the evacuation controller will wait to delete the evacuation until the pod has been terminated and
456+
removed from etcd.
442457

443458
#### Evacuation Instigator Finalizer
444459
To distinguish between instigator and other finalizers, instigators should use finalizers in the
@@ -1064,6 +1079,36 @@ workload pods with its evacuator class and a priority.
10641079
would be preferred over HPA. Otherwise, HPA might scale less or more than `.spec.maxSurge`.
10651080
- If the HPA scaling logic is preferred, a user could set a higher priority on the HPA object.
10661081
1082+
#### Descheduling and Downscaling
1083+
1084+
We can use the Evacuation API to deschedule a set of pods controlled by a Deployment/ReplicaSet.
1085+
This is useful when we want to remove a set of pods from a node, either for node maintenance
1086+
reasons or to rebalance the pods across additional nodes.
1087+
1088+
If set up correctly, the deployment controller will first scale up its pods to achieve this. In
1089+
order to support any de/scheduling constraints during downscaling, we should temporarily disable an
1090+
immediate upscaling.
1091+
1092+
HPA Downscaling example:
1093+
1094+
1. Pods of application A are created with Deployment controller evacuator annotation (priority 10000)
1095+
2. The Deployment and its pods are controlled/scaled by the HPA. The HPA sets an evacuator
1096+
annotation (higher priority, e.g. 11000) on all of these pods.
1097+
3. A subset of pods from application A are chosen by the HPA to be scaled down. This may be done in
1098+
a collaboration with another component responsible for resolving the scheduling constraints.
1099+
4. The HPA creates Evacuation objects for these chosen pods.
1100+
5. The evacuation controller designates the HPA as the evacuator based on the highest priority. No
1101+
action (termination) is taken on the pods yet.
1102+
6. The HPA downscales the Deployment workload.
1103+
7. The HPA sets ActiveEvacuatorCompleted to true on its own evacuations.
1104+
8. The evacuation controller designates the Deployment (controller) as the next evacuator.
1105+
9. The deployment subsequently downscales the underlying ReplicaSet(s).
1106+
10. The ReplicaSet controller deletes the pods to which an Evacuation object has been assigned,
1107+
preserving the scheduling constraints.
1108+
1109+
The same can be done by any descheduling controller (instead of an HPA) to re-balance a set of Pods
1110+
to comply with de/scheduling rules.
1111+
10671112
### Test Plan
10681113
10691114
[x] I/we understand the owners of the involved components may require updates to

0 commit comments

Comments
 (0)