39
39
- [ Pod Admission] ( #pod-admission )
40
40
- [ Immutability of EvictionRequest Spec Fields] ( #immutability-of-evictionrequest-spec-fields )
41
41
- [ EvictionRequest Process] ( #evictionrequest-process )
42
+ - [ EvictionRequest Cancellation Examples] ( #evictionrequest-cancellation-examples )
43
+ - [ Multiple Dynamic Requesters and No EvictionRequest Cancellation] ( #multiple-dynamic-requesters-and-no-evictionrequest-cancellation )
44
+ - [ Single Dynamic Requester and EvictionRequest Cancellation] ( #single-dynamic-requester-and-evictionrequest-cancellation )
45
+ - [ Single Dynamic Requester and Forbidden EvictionRequest Cancellation] ( #single-dynamic-requester-and-forbidden-evictionrequest-cancellation )
42
46
- [ Follow-up Design Details for Kubernetes Workloads] ( #follow-up-design-details-for-kubernetes-workloads )
43
47
- [ ReplicaSet Controller] ( #replicaset-controller )
44
48
- [ Deployment Controller] ( #deployment-controller )
@@ -268,7 +272,8 @@ or the last interceptor (lowest priority) has finished without terminating the p
268
272
request controller will attempt to evict the pod using the existing API-initiated eviction.
269
273
270
274
Multiple requesters can request the eviction of the same pod, and optionally withdraw their request
271
- in certain scenarios.
275
+ in certain scenarios
276
+ ([ EvictionRequest Cancellation Examples] ( #evictionrequest-cancellation-examples ) ).
272
277
273
278
We can think of EvictionRequest as a managed and safer alternative to eviction.
274
279
@@ -330,7 +335,8 @@ coincide with the deletion of the pod (evict or delete call). In some scenarios,
330
335
terminated (e.g. by a remote call) if the pod ` restartPolicy ` allows it, to preserve the pod data
331
336
for further processing or debugging.
332
337
333
- The interceptor can also choose whether it handles EvictionRequest cancellation.
338
+ The interceptor can also choose whether it handles EvictionRequest cancellation. See
339
+ [ EvictionRequest Cancellation Examples] ( #evictionrequest-cancellation-examples ) for details.
334
340
335
341
We should discourage the creation of preventive EvictionRequests, so that they do not end up as
336
342
another PDB. So we should design the API appropriately and also not allow behaviors that do not
@@ -497,7 +503,8 @@ already exists for this pod, the requester should still add itself to the finali
497
503
are used for :
498
504
- Tracking the requesters of this eviction request intent. This is used for observability and to
499
505
handle concurrency for multiple requesters asking for the cancellation. The eviction request can
500
- be cancelled/deleted once all requesters have asked for the cancellation.
506
+ be cancelled/deleted once all requesters have asked for the cancellation (see
507
+ [EvictionRequest Cancellation Examples](#evictionrequest-cancellation-examples) for details).
501
508
- Processing the eviction request result by the requester once the eviction process is complete.
502
509
503
510
If the eviction is no longer needed, the requester should remove itself from the finalizers of the
@@ -549,14 +556,14 @@ annotation. This annotation is parsed into the `Interceptor` type in the [Evicti
549
556
characters (`63 - len("priority_")`)
550
557
- ` PRIORITY` and `ROLE`
551
558
- ` controller` should always set a `PRIORITY=10000` and `ROLE=controller`.
552
- - Other interceptors should set `PRIORITY` according to their own needs (minimum value is 0,
553
- maximum value is 100000). Higher priorities are selected first by the eviction request
554
- controller. They can use the `controller` interceptor as a reference point, if they want to be
555
- run before or after the `controller` interceptor. They can also observe pod annotations and
556
- detect what other interceptors have been registered for the eviction process. `ROLE` is optional
557
- and can be used as a signal to other interceptors. The `controller` value is reserved for pod
558
- controllers, but otherwise there is no guidance on how the third party interceptors should name
559
- their role.
559
+ - Other interceptors should set `PRIORITY` according to their own needs (minimum value (lowest
560
+ priority) is 0, maximum value (highest priority) is 100000). Higher priorities are selected
561
+ first by the eviction request controller. They can use the `controller` interceptor as a
562
+ reference point, if they want to be run before or after the `controller` interceptor. They can
563
+ also observe pod annotations and detect what other interceptors have been registered for the
564
+ eviction process. `ROLE` is optional and can be used as a signal to other interceptors. The
565
+ ` controller ` value is reserved for pod controllers, but otherwise there is no guidance on how
566
+ the third party interceptors should name their role.
560
567
- Priorities `9900-10100` are reserved for interceptors with a class that has the same parent
561
568
domain as the controller interceptor. Duplicate priorities are not allowed in this interval.
562
569
- The number of interceptor annotations is limited to 30 in the 9900-10100 interval and to 70
@@ -609,7 +616,8 @@ it may update the status every 3 minutes. The status updates should look as foll
609
616
request process of the pod cannot be stopped/cancelled. This will block any DELETE requests on the
610
617
EvictionRequest object. If the interceptor supports eviction request cancellation, it should make
611
618
sure that this field is set to `Allow`, and it should be aware that the EvictionRequest object can
612
- be deleted at any time.
619
+ be deleted at any time. See
620
+ [EvictionRequest Cancellation Examples](#evictionrequest-cancellation-examples) for details.
613
621
- Update `.status.expectedInterceptorFinishTime` if a reasonable estimation can be made of how long
614
622
the eviction process will take for the current interceptor. This can be modified later to change
615
623
the estimate.
@@ -676,7 +684,7 @@ No attempt will be made to evict pods that are currently terminating.
676
684
If the pod eviction fails, e.g. due to a blocking PodDisruptionBudget, the
677
685
` .status.failedAPIEvictionCounter` is incremented and the pod is added back to the queue with
678
686
exponential backoff (maximum approx. 15 minutes). If there is a positive progress update in the
679
- ` .status.progressTimestamp` of the EvictionRequest, it will cancel the eviction.
687
+ ` .status.progressTimestamp` of the EvictionRequest, it will cancel the API-initated eviction.
680
688
681
689
# ### Garbage Collection
682
690
@@ -695,6 +703,9 @@ For convenience, we will also remove requester finalizers with
695
703
` evictionrequest.coordination.k8s.io/` prefix when the eviction request task is complete (points 2
696
704
and 3). Other finalizers will still block deletion.
697
705
706
+ For convenience, we will set `.status.evictionRequestCancellationPolicy` back to `Allow` if the
707
+ value is `Forbid` and the pod has been fully terminated.
708
+
698
709
# ## EvictionRequest API
699
710
700
711
` ` ` golang
@@ -908,7 +919,11 @@ The pod labels are merged with the EvictionRequest labels (pod labels have a pre
908
919
for custom label selectors when observing the eviction requests.
909
920
910
921
` .status.activeInterceptorClass ` should be empty on creation as its selection should be left on the
911
- eviction request controller.
922
+ eviction request controller. To strengthen the validation, we should check that it is possible to
923
+ set only the highest priority interceptor in the beginning. After that it is possible to set only
924
+ the next interceptor and so on. We can also condition this transition according to the other fields.
925
+ ` .status.ActiveInterceptorCompleted ` should be true or ` .status.ProgressTimestamp ` has exceeded the
926
+ deadline.
912
927
913
928
` .status.evictionRequestCancellationPolicy ` should be ` Allow ` on creation, as its resolution should be
914
929
left to the eviction request controller.
@@ -988,6 +1003,113 @@ The following diagrams describe what the EvictionRequest process will look like
988
1003
![ eviction-request-process] ( eviction-request-process.svg )
989
1004
990
1005
1006
+ ### EvictionRequest Cancellation Examples
1007
+
1008
+ Let's assume there is a single pod p-1 of application P with interceptors A and B:
1009
+
1010
+ ``` yaml
1011
+ apiVersion : v1
1012
+ kind : Pod
1013
+ metadata :
1014
+ annotations :
1015
+ interceptor.evictionrequest.coordination.k8s.io/priority_actor-a.k8s.io : " 10000/controller"
1016
+ interceptor.evictionrequest.coordination.k8s.io/priority_actor-b.k8s.io : " 11000/notifier-with-delay"
1017
+ name : p-1
1018
+ ` ` `
1019
+
1020
+ #### Multiple Dynamic Requesters and No EvictionRequest Cancellation
1021
+
1022
+ 1. A node drain controller starts draining a node Z and makes it unschedulable.
1023
+ 2. The node drain controller creates an EvictionRequest for the only pod p-1 of application P to
1024
+ evict it from a node. It sets the
1025
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer on the
1026
+ EvictionRequest.
1027
+ 3. The descheduling controller notices that the pod p-1 is running in the wrong zone. It wants to
1028
+ create an EvictionRequest (named after the pod's UID) for this pod, but the EvictionRequest
1029
+ already exists. It sets the
1030
+ ` requester.evictionrequest.coordination.k8s.io/name_descheduling.avalanche.io` finalizer on the
1031
+ EvictionRequest.
1032
+ 4. The eviction request controller designates Actor B as the next interceptor by updating
1033
+ ` .status.activeInterceptorClass` .
1034
+ 5. Actor B updates the EvictionRequest status and also sets
1035
+ ` .status.evictionRequestCancellationPolicy=Allow` .
1036
+ 6. Actor B begins notifying users of application P that the application will experience
1037
+ a disruption and delays the disruption so that the users can finish their work.
1038
+ 7. The admin changes his/her mind and cancels the node drain of node Z and makes it schedulable
1039
+ again.
1040
+ 8. The node drain controller removes the
1041
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer from the
1042
+ EvictionRequest.
1043
+ 9. The eviction request controller notices the change in finalizers, but there is still a
1044
+ descheduling finalizer, so no action is required.
1045
+ 10. Actor B sets `ActiveInterceptorCompleted=true` on the eviction requests of pod p-1, which is
1046
+ ready to be deleted.
1047
+ 11. The eviction request controller designates Actor A as the next interceptor by updating
1048
+ ` .status.activeInterceptorClass` .
1049
+ 12. Actor A updates the EvictionRequest status and ensures that
1050
+ ` .status.evictionRequestCancellationPolicy=Allow`
1051
+ 13. Actor A deletes the p-1 pod.
1052
+ 14. EvictionRequest is garbage collected once the pods terminate even with the descheduling
1053
+ finalizer present.
1054
+
1055
+ # ### Single Dynamic Requester and EvictionRequest Cancellation
1056
+
1057
+ 1. A node drain controller starts draining a node Z and makes it unschedulable.
1058
+ 2. The node drain controller creates an EvictionRequest for the only pod p-1 of application P to
1059
+ evict it from a node. It sets the
1060
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer on the
1061
+ EvictionRequest.
1062
+ 3. The eviction request controller designates Actor B as the next interceptor by updating
1063
+ ` .status.activeInterceptorClass` .
1064
+ 4. Actor B updates the EvictionRequest status and also sets
1065
+ ` .status.evictionRequestCancellationPolicy=Allow` .
1066
+ 5. Actor B begins notifying users of application P that the application will experience
1067
+ a disruption and delays the disruption so that the users can finish their work.
1068
+ 6. The admin changes his/her mind and cancels the node drain of node Z and makes it schedulable
1069
+ again.
1070
+ 7. The node drain controller removes the
1071
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer from the
1072
+ EvictionRequest.
1073
+ 8. The eviction request controller notices the change in finalizers, and deletes (GC) the
1074
+ EvictionRequest as there is no requester present.
1075
+ 9. Actor B can detect the removal of the EvictionRequest object and notify users of application P
1076
+ that the disruption has been cancelled. If it misses the deletion event, then no notification
1077
+ will be delivered. To avoid this, Actor B had the option of also setting a finalizer on the
1078
+ EvictionRequest.
1079
+
1080
+ # ### Single Dynamic Requester and Forbidden EvictionRequest Cancellation
1081
+
1082
+ 1. A node drain controller starts draining a node Z and makes it unschedulable.
1083
+ 2. The node drain controller creates an EvictionRequest for the only pod p-1 of application P to
1084
+ evict it from a node. It sets the
1085
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer on the
1086
+ EvictionRequest.
1087
+ 3. The eviction request controller designates Actor B as the next interceptor by updating
1088
+ ` .status.activeInterceptorClass` .
1089
+ 4. Actor B updates the EvictionRequest status and also sets
1090
+ ` .status.evictionRequestCancellationPolicy=Forbid` to prevent the EvictionRequest from deletion
1091
+ (enforced by API Admission).
1092
+ 5. Actor B begins notifying users of application P that the application will experience
1093
+ a disruption and delays the disruption so that the users can finish their work.
1094
+ 6. The admin changes his/her mind and cancels the node drain of node Z and makes it schedulable
1095
+ again.
1096
+ 7. The node drain controller removes the
1097
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer from the
1098
+ EvictionRequest.
1099
+ 8. The eviction request controller notices the change in finalizers. Normally it should delete (GC)
1100
+ the EvictionRequest as there is no requester present, but
1101
+ ` .status.evictionRequestCancellationPolicy=Forbid` prevents this.
1102
+ 9. Actor B sets `ActiveInterceptorCompleted=true` on the eviction requests of pod p-1, which is
1103
+ ready to be deleted.
1104
+ 10. The eviction request controller designates Actor A as the next interceptor by updating
1105
+ ` .status.activeInterceptorClass` .
1106
+ 11. Actor A updates the EvictionRequest status and ensures that
1107
+ ` .status.evictionRequestCancellationPolicy=Forbid` . Alternatively, it could also change it to
1108
+ ` Allow` at this point, if it was just there, to ensure that Actor B's logic is atomic
1109
+ 12. Actor A deletes the p-1 pod.
1110
+ 13. EvictionRequest is garbage collected once the pods terminate. It has to first set
1111
+ ` .status.evictionRequestCancellationPolicy=Allow` to allow the object to be deleted.
1112
+
991
1113
# ## Follow-up Design Details for Kubernetes Workloads
992
1114
993
1115
Kubernetes Workloads should be made aware of the EvictionRequest API to properly support the
@@ -1095,7 +1217,8 @@ disruption for the underlying application. By scaling up first before terminatin
1095
1217
3. The node drain controller creates an EvictionRequests for a subset B of pods A to evict them from
1096
1218
a node.
1097
1219
4. The eviction request controller designates the deployment controller as the interceptor based on
1098
- the highest priority. No action (termination) is taken on the pods yet.
1220
+ the highest priority by updating `.status.activeInterceptorClass`. No action (termination) is
1221
+ taken on the pods yet.
1099
1222
5. The deployment controller creates a set of surge pods C to compensate for the future loss of
1100
1223
availability of pods B. The new pods are created by temporarily surging the `.spec.replicas`
1101
1224
count of the underlying replica sets up to the value of deployments `maxSurge`.
@@ -1104,7 +1227,8 @@ disruption for the underlying application. By scaling up first before terminatin
1104
1227
8. The deployment controller scales down the surging replica sets back to their original value.
1105
1228
9. The deployment controller sets `ActiveInterceptorCompleted=true` on the eviction requests of
1106
1229
pods B that are ready to be deleted.
1107
- 10 . The eviction request controller designates the replica set controller as the next interceptor.
1230
+ 10. The eviction request controller designates the replica set controller as the next interceptor by
1231
+ updating `.status.activeInterceptorClass`.
1108
1232
11. The replica set controller deletes the pods to which an EvictionRequest object has been
1109
1233
assigned, preserving the availability of the application.
1110
1234
@@ -1194,15 +1318,17 @@ first before terminating the pods.
1194
1318
4. The node drain controller creates an EvictionRequest for the only pod of application W to evict
1195
1319
it from a node.
1196
1320
5. The eviction request controller designates the HPA as the interceptor based on the highest
1197
- priority. No action (termination) is taken on the single pod yet.
1321
+ priority by updating `.status.activeInterceptorClass`. No action (termination) is taken on the
1322
+ single pod yet.
1198
1323
6. The HPA controller creates a single surge pod B to compensate for the future loss of
1199
1324
availability of pod A. The new pod is created by temporarily scaling up the deployment.
1200
1325
7. Pod B is scheduled on a new schedulable node that is not under the node drain.
1201
1326
8. Pod B becomes available.
1202
1327
9. The HPA scales the surging deployment back down to 1 replica.
1203
1328
10. The HPA sets `ActiveInterceptorCompleted=true` on the eviction requests of pod A, which is ready
1204
1329
to be deleted.
1205
- 11 . The eviction request controller designates the replica set controller as the next interceptor.
1330
+ 11. The eviction request controller designates the replica set controller as the next interceptor by
1331
+ updating `.status.activeInterceptorClass`.
1206
1332
12. The replica set controller deletes the pods to which an EvictionRequest object has been
1207
1333
assigned, preserving the availability of the webserver.
1208
1334
@@ -1230,11 +1356,13 @@ HPA Downscaling example:
1230
1356
priority. No action (termination) is taken on the pods yet.
1231
1357
6. The HPA downscales the Deployment workload.
1232
1358
7. The HPA sets `ActiveInterceptorCompleted=true` on its own eviction requests.
1233
- 8 . The eviction request controller designates the deployment controller as the next interceptor.
1359
+ 8. The eviction request controller designates the deployment controller as the next interceptor by
1360
+ updating `.status.activeInterceptorClass`.
1234
1361
9. The deployment controller subsequently scales down the underlying ReplicaSet(s).
1235
1362
10. The deployment controller sets `ActiveInterceptorCompleted=true` on the eviction requests of
1236
1363
pods that are ready to be deleted.
1237
- 11 . The eviction request controller designates the replica set controller as the next interceptor.
1364
+ 11. The eviction request controller designates the replica set controller as the next interceptor by
1365
+ updating `.status.activeInterceptorClass`.
1238
1366
12. The replica set controller deletes the pods to which an EvictionRequest object has been
1239
1367
assigned, preserving the scheduling constraints.
1240
1368
@@ -1772,6 +1900,7 @@ Pros:
1772
1900
- Versatility; users can use any name they see fit.
1773
1901
- ` .metadata.generateName` is supported.
1774
1902
- Actors in the system have a greater incentive to use `.spec.podRef`.
1903
+
1775
1904
Cons :
1776
1905
- Name conflict resolution is left up to the users, but as a workaround they can simply generate the
1777
1906
name.
0 commit comments