|
16 | 16 | - [Story 2](#story-2)
|
17 | 17 | - [Story 3](#story-3)
|
18 | 18 | - [Story 4](#story-4)
|
| 19 | + - [Story 5](#story-5) |
19 | 20 | - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
|
20 | 21 | - [Length Limitations for Pod Annotations and Evacuation Finalizers](#length-limitations-for-pod-annotations-and-evacuation-finalizers)
|
21 | 22 | - [Risks and Mitigations](#risks-and-mitigations)
|
|
44 | 45 | - [StatefulSet Controller](#statefulset-controller)
|
45 | 46 | - [DaemonSet and Static Pods](#daemonset-and-static-pods)
|
46 | 47 | - [HorizontalPodAutoscaler](#horizontalpodautoscaler)
|
| 48 | + - [Descheduling and Downscaling](#descheduling-and-downscaling) |
47 | 49 | - [Test Plan](#test-plan)
|
48 | 50 | - [Prerequisite testing updates](#prerequisite-testing-updates)
|
49 | 51 | - [Unit tests](#unit-tests)
|
@@ -161,21 +163,28 @@ The major issues are:
|
161 | 163 | possible to disrupt the pods during a high load without experiencing application downtime. If
|
162 | 164 | the minimum number of pods is 1, PDBs cannot be used without blocking the node drain. This has
|
163 | 165 | been discussed in issue [kubernetes/kubernetes#93476](https://github.com/kubernetes/kubernetes/issues/93476).
|
164 |
| -3. Graceful deletion of DaemonSet pods is currently only supported as part of (Linux) graceful node |
| 166 | +3. Replicaset scaling down does not take inter-pods scheduling constraints into consideration. The |
| 167 | + current mechanism for choosing pods to terminate takes only [creation time, |
| 168 | + node rank](https://github.com/kubernetes/kubernetes/blob/cae35dba5a3060711a2a3f958537003bc74a59c0/pkg/controller/replicaset/replica_set.go#L822-L832), |
| 169 | + and [pod-deletion-cost annotation](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost) |
| 170 | + into account. This is not sufficient, and it can dis-balance the pods across the nodes as |
| 171 | + described in [kubernetes/kubernetes#124306](https://github.com/kubernetes/kubernetes/issues/124306) |
| 172 | + and [many other issues](https://github.com/kubernetes/kubernetes/issues/124306#issuecomment-2493091257). |
| 173 | +4. Descheduler does not allow postponing eviction for applications that are unable to be evicted |
| 174 | + immediately. This can result in descheduling of incorrect set of pods. This is outlined in the |
| 175 | + KEP [kubernetes-sigs/descheduler#1354](https://github.com/kubernetes-sigs/descheduler/pull/1354). |
| 176 | +5. Graceful deletion of DaemonSet pods is currently only supported as part of (Linux) graceful node |
165 | 177 | shutdown. The length of the shutdown is again not application specific and is set cluster-wide
|
166 | 178 | (optionally by priority) by the cluster admin. This does not take into account
|
167 | 179 | `.spec.terminationGracePeriodSeconds` of each pod and may cause premature termination of
|
168 | 180 | the application. This has been discussed in issue [kubernetes/kubernetes#75482](https://github.com/kubernetes/kubernetes/issues/75482)
|
169 | 181 | and in issue [kubernetes-sigs/cluster-api#6158](https://github.com/kubernetes-sigs/cluster-api/issues/6158).
|
170 |
| -4. Different pod termination mechanisms are not synchronized with each other. So for example, the |
| 182 | +6. Different pod termination mechanisms are not synchronized with each other. So for example, the |
171 | 183 | taint manager may prematurely terminate pods that are currently under Node Graceful Shutdown.
|
172 | 184 | This can also happen with other mechanism (e.g., different types of evictions). This has been
|
173 | 185 | discussed in the issue [kubernetes/kubernetes#124448](https://github.com/kubernetes/kubernetes/issues/124448)
|
174 | 186 | and in the issue [kubernetes/kubernetes#72129](https://github.com/kubernetes/kubernetes/issues/72129).
|
175 |
| -5. Descheduler does not allow postponing eviction for applications that are unable to be evicted |
176 |
| - immediately. This can result in descheduling of incorrect set of pods. This is outlined in the |
177 |
| - KEP [kubernetes-sigs/descheduler#1354](https://github.com/kubernetes-sigs/descheduler/pull/1354). |
178 |
| -6. [Affinity Based Eviction](https://github.com/kubernetes/enhancements/issues/4328) is an upcoming |
| 187 | +7. [Affinity Based Eviction](https://github.com/kubernetes/enhancements/issues/4328) is an upcoming |
179 | 188 | feature that would like to introduce the `RequiredDuringSchedulingRequiredDuringExecution`
|
180 | 189 | nodeAffinity option to remove pods from nodes that do not match this affinity. The controller
|
181 | 190 | proposed by this feature would like to use the Evacuation API for the disruption safety and
|
@@ -331,11 +340,17 @@ be able to identify other evacuators and an order in which they will run.
|
331 | 340 |
|
332 | 341 | #### Story 4
|
333 | 342 |
|
334 |
| -As an application owner I want my pods to be scheduled on correct nodes. I want to use the |
| 343 | +As an application owner, I want my pods to be scheduled on correct nodes. I want to use the |
335 | 344 | descheduler or the upcoming Affinity Based Eviction feature to remove pods from incorrect nodes
|
336 | 345 | and then have the pods scheduled on new ones. I want to do the rescheduling gracefully and be able
|
337 | 346 | to control the disruption level of my application (even 0% application unavailability).
|
338 | 347 |
|
| 348 | +#### Story 5 |
| 349 | + |
| 350 | +As an application owner, I run a large Deployment with pods that utilize TopologySpreadConstraints. |
| 351 | +I want to downscale such a Deployment so that these constraints are preserved and the pods are |
| 352 | +correctly balanced across the nodes. |
| 353 | + |
339 | 354 | ### Notes/Constraints/Caveats (Optional)
|
340 | 355 |
|
341 | 356 | #### Length Limitations for Pod Annotations and Evacuation Finalizers
|
@@ -437,8 +452,8 @@ exists, the instigator should still add itself to the finalizers. The finalizers
|
437 | 452 | If the evacuation is no longer needed, the instigator should remove itself from the finalizers.
|
438 | 453 | The evacuation will be then deleted by the evacuation controller. In case the evacuator has set
|
439 | 454 | `.status.evacuationCancellationPolicy` to `Forbid`, the evacuation process cannot be cancelled, and
|
440 |
| -the evacuation controller will wait to delete the pod until the pod has been terminated and removed |
441 |
| -from etcd. |
| 455 | +the evacuation controller will wait to delete the evacuation until the pod has been terminated and |
| 456 | +removed from etcd. |
442 | 457 |
|
443 | 458 | #### Evacuation Instigator Finalizer
|
444 | 459 | To distinguish between instigator and other finalizers, instigators should use finalizers in the
|
@@ -1064,6 +1079,36 @@ workload pods with its evacuator class and a priority.
|
1064 | 1079 | would be preferred over HPA. Otherwise, HPA might scale less or more than `.spec.maxSurge`.
|
1065 | 1080 | - If the HPA scaling logic is preferred, a user could set a higher priority on the HPA object.
|
1066 | 1081 |
|
| 1082 | +#### Descheduling and Downscaling |
| 1083 | +
|
| 1084 | +We can use the Evacuation API to deschedule a set of pods controlled by a Deployment/ReplicaSet. |
| 1085 | +This is useful when we want to remove a set of pods from a node, either for node maintenance |
| 1086 | +reasons or to rebalance the pods across additional nodes. |
| 1087 | +
|
| 1088 | +If set up correctly, the deployment controller will first scale up its pods to achieve this. In |
| 1089 | +order to support any de/scheduling constraints during downscaling, we should temporarily disable an |
| 1090 | +immediate upscaling. |
| 1091 | +
|
| 1092 | +HPA Downscaling example: |
| 1093 | +
|
| 1094 | +1. Pods of application A are created with Deployment controller evacuator annotation (priority 10000) |
| 1095 | +2. The Deployment and its pods are controlled/scaled by the HPA. The HPA sets an evacuator |
| 1096 | + annotation (higher priority, e.g. 11000) on all of these pods. |
| 1097 | +3. A subset of pods from application A are chosen by the HPA to be scaled down. This may be done in |
| 1098 | + a collaboration with another component responsible for resolving the scheduling constraints. |
| 1099 | +4. The HPA creates Evacuation objects for these chosen pods. |
| 1100 | +5. The evacuation controller designates the HPA as the evacuator based on the highest priority. No |
| 1101 | + action (termination) is taken on the pods yet. |
| 1102 | +6. The HPA downscales the Deployment workload. |
| 1103 | +7. The HPA sets ActiveEvacuatorCompleted to true on its own evacuations. |
| 1104 | +8. The evacuation controller designates the Deployment (controller) as the next evacuator. |
| 1105 | +9. The deployment subsequently downscales the underlying ReplicaSet(s). |
| 1106 | +10. The ReplicaSet controller deletes the pods to which an Evacuation object has been assigned, |
| 1107 | + preserving the scheduling constraints. |
| 1108 | +
|
| 1109 | +The same can be done by any descheduling controller (instead of an HPA) to re-balance a set of Pods |
| 1110 | +to comply with de/scheduling rules. |
| 1111 | +
|
1067 | 1112 | ### Test Plan
|
1068 | 1113 |
|
1069 | 1114 | [x] I/we understand the owners of the involved components may require updates to
|
|
0 commit comments