39
39
- [ Pod Admission] ( #pod-admission )
40
40
- [ Immutability of EvictionRequest Spec Fields] ( #immutability-of-evictionrequest-spec-fields )
41
41
- [ EvictionRequest Process] ( #evictionrequest-process )
42
+ - [ EvictionRequest Cancellation Examples] ( #evictionrequest-cancellation-examples )
43
+ - [ Multiple Dynamic Requesters and No EvictionRequest Cancellation] ( #multiple-dynamic-requesters-and-no-evictionrequest-cancellation )
44
+ - [ Single Dynamic Requester and EvictionRequest Cancellation] ( #single-dynamic-requester-and-evictionrequest-cancellation )
45
+ - [ Single Dynamic Requester and Forbidden EvictionRequest Cancellation] ( #single-dynamic-requester-and-forbidden-evictionrequest-cancellation )
42
46
- [ Follow-up Design Details for Kubernetes Workloads] ( #follow-up-design-details-for-kubernetes-workloads )
43
47
- [ ReplicaSet Controller] ( #replicaset-controller )
44
48
- [ Deployment Controller] ( #deployment-controller )
@@ -330,7 +334,8 @@ coincide with the deletion of the pod (evict or delete call). In some scenarios,
330
334
terminated (e.g. by a remote call) if the pod ` restartPolicy ` allows it, to preserve the pod data
331
335
for further processing or debugging.
332
336
333
- The interceptor can also choose whether it handles EvictionRequest cancellation.
337
+ The interceptor can also choose whether it handles EvictionRequest cancellation. See
338
+ [ EvictionRequest Cancellation Examples] ( #evictionrequest-cancellation-examples ) for details.
334
339
335
340
We should discourage the creation of preventive EvictionRequests, so that they do not end up as
336
341
another PDB. So we should design the API appropriately and also not allow behaviors that do not
@@ -497,7 +502,8 @@ already exists for this pod, the requester should still add itself to the finali
497
502
are used for :
498
503
- Tracking the requesters of this eviction request intent. This is used for observability and to
499
504
handle concurrency for multiple requesters asking for the cancellation. The eviction request can
500
- be cancelled/deleted once all requesters have asked for the cancellation.
505
+ be cancelled/deleted once all requesters have asked for the cancellation (see
506
+ [EvictionRequest Cancellation Examples](#evictionrequest-cancellation-examples) for details).
501
507
- Processing the eviction request result by the requester once the eviction process is complete.
502
508
503
509
If the eviction is no longer needed, the requester should remove itself from the finalizers of the
@@ -549,14 +555,14 @@ annotation. This annotation is parsed into the `Interceptor` type in the [Evicti
549
555
characters (`63 - len("priority_")`)
550
556
- ` PRIORITY` and `ROLE`
551
557
- ` controller` should always set a `PRIORITY=10000` and `ROLE=controller`.
552
- - Other interceptors should set `PRIORITY` according to their own needs (minimum value is 0,
553
- maximum value is 100000). Higher priorities are selected first by the eviction request
554
- controller. They can use the `controller` interceptor as a reference point, if they want to be
555
- run before or after the `controller` interceptor. They can also observe pod annotations and
556
- detect what other interceptors have been registered for the eviction process. `ROLE` is optional
557
- and can be used as a signal to other interceptors. The `controller` value is reserved for pod
558
- controllers, but otherwise there is no guidance on how the third party interceptors should name
559
- their role.
558
+ - Other interceptors should set `PRIORITY` according to their own needs (minimum value (lowest
559
+ priority) is 0, maximum value (highest priority) is 100000). Higher priorities are selected
560
+ first by the eviction request controller. They can use the `controller` interceptor as a
561
+ reference point, if they want to be run before or after the `controller` interceptor. They can
562
+ also observe pod annotations and detect what other interceptors have been registered for the
563
+ eviction process. `ROLE` is optional and can be used as a signal to other interceptors. The
564
+ ` controller ` value is reserved for pod controllers, but otherwise there is no guidance on how
565
+ the third party interceptors should name their role.
560
566
- Priorities `9900-10100` are reserved for interceptors with a class that has the same parent
561
567
domain as the controller interceptor. Duplicate priorities are not allowed in this interval.
562
568
- The number of interceptor annotations is limited to 30 in the 9900-10100 interval and to 70
@@ -609,7 +615,8 @@ it may update the status every 3 minutes. The status updates should look as foll
609
615
request process of the pod cannot be stopped/cancelled. This will block any DELETE requests on the
610
616
EvictionRequest object. If the interceptor supports eviction request cancellation, it should make
611
617
sure that this field is set to `Allow`, and it should be aware that the EvictionRequest object can
612
- be deleted at any time.
618
+ be deleted at any time. See
619
+ [EvictionRequest Cancellation Examples](#evictionrequest-cancellation-examples) for details.
613
620
- Update `.status.expectedInterceptorFinishTime` if a reasonable estimation can be made of how long
614
621
the eviction process will take for the current interceptor. This can be modified later to change
615
622
the estimate.
@@ -676,7 +683,7 @@ No attempt will be made to evict pods that are currently terminating.
676
683
If the pod eviction fails, e.g. due to a blocking PodDisruptionBudget, the
677
684
` .status.failedAPIEvictionCounter` is incremented and the pod is added back to the queue with
678
685
exponential backoff (maximum approx. 15 minutes). If there is a positive progress update in the
679
- ` .status.progressTimestamp` of the EvictionRequest, it will cancel the eviction.
686
+ ` .status.progressTimestamp` of the EvictionRequest, it will cancel the API-initated eviction.
680
687
681
688
# ### Garbage Collection
682
689
@@ -695,6 +702,9 @@ For convenience, we will also remove requester finalizers with
695
702
` evictionrequest.coordination.k8s.io/` prefix when the eviction request task is complete (points 2
696
703
and 3). Other finalizers will still block deletion.
697
704
705
+ For convenience, we will set `.status.evictionRequestCancellationPolicy` back to `Allow` if the
706
+ value is `Forbid` and the pod has been fully terminated.
707
+
698
708
# ## EvictionRequest API
699
709
700
710
` ` ` golang
@@ -908,7 +918,11 @@ The pod labels are merged with the EvictionRequest labels (pod labels have a pre
908
918
for custom label selectors when observing the eviction requests.
909
919
910
920
` .status.activeInterceptorClass ` should be empty on creation as its selection should be left on the
911
- eviction request controller.
921
+ eviction request controller. To strengthen the validation, we should check that it is possible to
922
+ set only the highest priority interceptor in the beginning. After that it is possible to set only
923
+ the next interceptor and so on. We can also condition this transition according to the other fields.
924
+ ` .status.ActiveInterceptorCompleted ` should be true or ` .status.ProgressTimestamp ` has exceeded the
925
+ deadline.
912
926
913
927
` .status.evictionRequestCancellationPolicy ` should be ` Allow ` on creation, as its resolution should be
914
928
left to the eviction request controller.
@@ -988,6 +1002,113 @@ The following diagrams describe what the EvictionRequest process will look like
988
1002
![ eviction-request-process] ( eviction-request-process.svg )
989
1003
990
1004
1005
+ ### EvictionRequest Cancellation Examples
1006
+
1007
+ Let's assume there is a single pod p-1 of application P with interceptors A and B:
1008
+
1009
+ ``` yaml
1010
+ apiVersion : v1
1011
+ kind : Pod
1012
+ metadata :
1013
+ annotations :
1014
+ interceptor.evictionrequest.coordination.k8s.io/priority_actor-a.k8s.io : " 10000/controller"
1015
+ interceptor.evictionrequest.coordination.k8s.io/priority_actor-b.k8s.io : " 11000/notifier-with-delay"
1016
+ name : p-1
1017
+ ` ` `
1018
+
1019
+ #### Multiple Dynamic Requesters and No EvictionRequest Cancellation
1020
+
1021
+ 1. A node drain controller starts draining a node Z and makes it unschedulable.
1022
+ 2. The node drain controller creates an EvictionRequest for the only pod p-1 of application P to
1023
+ evict it from a node. It sets the
1024
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer on the
1025
+ EvictionRequest.
1026
+ 3. The descheduling controller notices that the pod p-1 is running in the wrong zone. It wants to
1027
+ create an EvictionRequest (named after the pod's UID) for this pod, but the EvictionRequest
1028
+ already exists. It sets the
1029
+ ` requester.evictionrequest.coordination.k8s.io/name_descheduling.avalanche.io` finalizer on the
1030
+ EvictionRequest.
1031
+ 4. The eviction request controller designates Actor B as the next interceptor by updating
1032
+ ` .status.activeInterceptorClass` .
1033
+ 5. Actor B updates the EvictionRequest status and also sets
1034
+ ` .status.evictionRequestCancellationPolicy=Allow` .
1035
+ 6. Actor B begins notifying users of application P that the application will experience
1036
+ a disruption and delays the disruption so that the users can finish their work.
1037
+ 7. The admin changes his/her mind and cancels the node drain of node Z and makes it schedulable
1038
+ again.
1039
+ 8. The node drain controller removes the
1040
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer from the
1041
+ EvictionRequest.
1042
+ 9. The eviction request controller notices the change in finalizers, but there is still a
1043
+ descheduling finalizer, so no action is required.
1044
+ 10. Actor B sets `ActiveInterceptorCompleted=true` on the eviction requests of pod p-1, which is
1045
+ ready to be deleted.
1046
+ 11. The eviction request controller designates Actor A as the next interceptor by updating
1047
+ ` .status.activeInterceptorClass` .
1048
+ 12. Actor A updates the EvictionRequest status and ensures that
1049
+ ` .status.evictionRequestCancellationPolicy=Allow`
1050
+ 13. Actor A deletes the p-1 pod.
1051
+ 14. EvictionRequest is garbage collected once the pods terminate even with the descheduling
1052
+ finalizer present.
1053
+
1054
+ # ### Single Dynamic Requester and EvictionRequest Cancellation
1055
+
1056
+ 1. A node drain controller starts draining a node Z and makes it unschedulable.
1057
+ 2. The node drain controller creates an EvictionRequest for the only pod p-1 of application P to
1058
+ evict it from a node. It sets the
1059
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer on the
1060
+ EvictionRequest.
1061
+ 3. The eviction request controller designates Actor B as the next interceptor by updating
1062
+ ` .status.activeInterceptorClass` .
1063
+ 4. Actor B updates the EvictionRequest status and also sets
1064
+ ` .status.evictionRequestCancellationPolicy=Allow` .
1065
+ 5. Actor B begins notifying users of application P that the application will experience
1066
+ a disruption and delays the disruption so that the users can finish their work.
1067
+ 6. The admin changes his/her mind and cancels the node drain of node Z and makes it schedulable
1068
+ again.
1069
+ 7. The node drain controller removes the
1070
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer from the
1071
+ EvictionRequest.
1072
+ 8. The eviction request controller notices the change in finalizers, and deletes (GC) the
1073
+ EvictionRequest as there is no requester present.
1074
+ 9. Actor B can detect the removal of the EvictionRequest object and notify users of application P
1075
+ that the disruption has been cancelled. If it misses the deletion event, then no notification
1076
+ will be delivered. To avoid this, Actor B had the option of also setting a finalizer on the
1077
+ EvictionRequest.
1078
+
1079
+ # ### Single Dynamic Requester and Forbidden EvictionRequest Cancellation
1080
+
1081
+ 1. A node drain controller starts draining a node Z and makes it unschedulable.
1082
+ 2. The node drain controller creates an EvictionRequest for the only pod p-1 of application P to
1083
+ evict it from a node. It sets the
1084
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer on the
1085
+ EvictionRequest.
1086
+ 3. The eviction request controller designates Actor B as the next interceptor by updating
1087
+ ` .status.activeInterceptorClass` .
1088
+ 4. Actor B updates the EvictionRequest status and also sets
1089
+ ` .status.evictionRequestCancellationPolicy=Forbid` to prevent the EvictionRequest from deletion
1090
+ (enforced by API Admission).
1091
+ 5. Actor B begins notifying users of application P that the application will experience
1092
+ a disruption and delays the disruption so that the users can finish their work.
1093
+ 6. The admin changes his/her mind and cancels the node drain of node Z and makes it schedulable
1094
+ again.
1095
+ 7. The node drain controller removes the
1096
+ ` requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer from the
1097
+ EvictionRequest.
1098
+ 8. The eviction request controller notices the change in finalizers. Normally it should delete (GC)
1099
+ the EvictionRequest as there is no requester present, but
1100
+ ` .status.evictionRequestCancellationPolicy=Forbid` prevents this.
1101
+ 9. Actor B sets `ActiveInterceptorCompleted=true` on the eviction requests of pod p-1, which is
1102
+ ready to be deleted.
1103
+ 10. The eviction request controller designates Actor A as the next interceptor by updating
1104
+ ` .status.activeInterceptorClass` .
1105
+ 11. Actor A updates the EvictionRequest status and ensures that
1106
+ ` .status.evictionRequestCancellationPolicy=Forbid` . Alternatively, it could also change it to
1107
+ ` Allow` at this point, if it was just there, to ensure that Actor B's logic is atomic
1108
+ 12. Actor A deletes the p-1 pod.
1109
+ 13. EvictionRequest is garbage collected once the pods terminate. It has to first set
1110
+ ` .status.evictionRequestCancellationPolicy=Allow` to allow the object to be deleted.
1111
+
991
1112
# ## Follow-up Design Details for Kubernetes Workloads
992
1113
993
1114
Kubernetes Workloads should be made aware of the EvictionRequest API to properly support the
@@ -1095,7 +1216,8 @@ disruption for the underlying application. By scaling up first before terminatin
1095
1216
3. The node drain controller creates an EvictionRequests for a subset B of pods A to evict them from
1096
1217
a node.
1097
1218
4. The eviction request controller designates the deployment controller as the interceptor based on
1098
- the highest priority. No action (termination) is taken on the pods yet.
1219
+ the highest priority by updating `.status.activeInterceptorClass`. No action (termination) is
1220
+ taken on the pods yet.
1099
1221
5. The deployment controller creates a set of surge pods C to compensate for the future loss of
1100
1222
availability of pods B. The new pods are created by temporarily surging the `.spec.replicas`
1101
1223
count of the underlying replica sets up to the value of deployments `maxSurge`.
@@ -1104,7 +1226,8 @@ disruption for the underlying application. By scaling up first before terminatin
1104
1226
8. The deployment controller scales down the surging replica sets back to their original value.
1105
1227
9. The deployment controller sets `ActiveInterceptorCompleted=true` on the eviction requests of
1106
1228
pods B that are ready to be deleted.
1107
- 10 . The eviction request controller designates the replica set controller as the next interceptor.
1229
+ 10. The eviction request controller designates the replica set controller as the next interceptor by
1230
+ updating `.status.activeInterceptorClass`.
1108
1231
11. The replica set controller deletes the pods to which an EvictionRequest object has been
1109
1232
assigned, preserving the availability of the application.
1110
1233
@@ -1194,15 +1317,17 @@ first before terminating the pods.
1194
1317
4. The node drain controller creates an EvictionRequest for the only pod of application W to evict
1195
1318
it from a node.
1196
1319
5. The eviction request controller designates the HPA as the interceptor based on the highest
1197
- priority. No action (termination) is taken on the single pod yet.
1320
+ priority by updating `.status.activeInterceptorClass`. No action (termination) is taken on the
1321
+ single pod yet.
1198
1322
6. The HPA controller creates a single surge pod B to compensate for the future loss of
1199
1323
availability of pod A. The new pod is created by temporarily scaling up the deployment.
1200
1324
7. Pod B is scheduled on a new schedulable node that is not under the node drain.
1201
1325
8. Pod B becomes available.
1202
1326
9. The HPA scales the surging deployment back down to 1 replica.
1203
1327
10. The HPA sets `ActiveInterceptorCompleted=true` on the eviction requests of pod A, which is ready
1204
1328
to be deleted.
1205
- 11 . The eviction request controller designates the replica set controller as the next interceptor.
1329
+ 11. The eviction request controller designates the replica set controller as the next interceptor by
1330
+ updating `.status.activeInterceptorClass`.
1206
1331
12. The replica set controller deletes the pods to which an EvictionRequest object has been
1207
1332
assigned, preserving the availability of the webserver.
1208
1333
@@ -1230,11 +1355,13 @@ HPA Downscaling example:
1230
1355
priority. No action (termination) is taken on the pods yet.
1231
1356
6. The HPA downscales the Deployment workload.
1232
1357
7. The HPA sets `ActiveInterceptorCompleted=true` on its own eviction requests.
1233
- 8 . The eviction request controller designates the deployment controller as the next interceptor.
1358
+ 8. The eviction request controller designates the deployment controller as the next interceptor by
1359
+ updating `.status.activeInterceptorClass`.
1234
1360
9. The deployment controller subsequently scales down the underlying ReplicaSet(s).
1235
1361
10. The deployment controller sets `ActiveInterceptorCompleted=true` on the eviction requests of
1236
1362
pods that are ready to be deleted.
1237
- 11 . The eviction request controller designates the replica set controller as the next interceptor.
1363
+ 11. The eviction request controller designates the replica set controller as the next interceptor by
1364
+ updating `.status.activeInterceptorClass`.
1238
1365
12. The replica set controller deletes the pods to which an EvictionRequest object has been
1239
1366
assigned, preserving the scheduling constraints.
1240
1367
@@ -1772,6 +1899,7 @@ Pros:
1772
1899
- Versatility; users can use any name they see fit.
1773
1900
- ` .metadata.generateName` is supported.
1774
1901
- Actors in the system have a greater incentive to use `.spec.podRef`.
1902
+
1775
1903
Cons :
1776
1904
- Name conflict resolution is left up to the users, but as a workaround they can simply generate the
1777
1905
name.
0 commit comments