Skip to content

Commit 9df74ad

Browse files
committed
add EvictionRequest Cancellation Examples section + fixes
1 parent 921e1df commit 9df74ad

File tree

1 file changed

+147
-19
lines changed
  • keps/sig-apps/4563-eviction-request-api

1 file changed

+147
-19
lines changed

keps/sig-apps/4563-eviction-request-api/README.md

+147-19
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,10 @@
3939
- [Pod Admission](#pod-admission)
4040
- [Immutability of EvictionRequest Spec Fields](#immutability-of-evictionrequest-spec-fields)
4141
- [EvictionRequest Process](#evictionrequest-process)
42+
- [EvictionRequest Cancellation Examples](#evictionrequest-cancellation-examples)
43+
- [Multiple Dynamic Requesters and No EvictionRequest Cancellation](#multiple-dynamic-requesters-and-no-evictionrequest-cancellation)
44+
- [Single Dynamic Requester and EvictionRequest Cancellation](#single-dynamic-requester-and-evictionrequest-cancellation)
45+
- [Single Dynamic Requester and Forbidden EvictionRequest Cancellation](#single-dynamic-requester-and-forbidden-evictionrequest-cancellation)
4246
- [Follow-up Design Details for Kubernetes Workloads](#follow-up-design-details-for-kubernetes-workloads)
4347
- [ReplicaSet Controller](#replicaset-controller)
4448
- [Deployment Controller](#deployment-controller)
@@ -330,7 +334,8 @@ coincide with the deletion of the pod (evict or delete call). In some scenarios,
330334
terminated (e.g. by a remote call) if the pod `restartPolicy` allows it, to preserve the pod data
331335
for further processing or debugging.
332336

333-
The interceptor can also choose whether it handles EvictionRequest cancellation.
337+
The interceptor can also choose whether it handles EvictionRequest cancellation. See
338+
[EvictionRequest Cancellation Examples](#evictionrequest-cancellation-examples) for details.
334339

335340
We should discourage the creation of preventive EvictionRequests, so that they do not end up as
336341
another PDB. So we should design the API appropriately and also not allow behaviors that do not
@@ -497,7 +502,8 @@ already exists for this pod, the requester should still add itself to the finali
497502
are used for:
498503
- Tracking the requesters of this eviction request intent. This is used for observability and to
499504
handle concurrency for multiple requesters asking for the cancellation. The eviction request can
500-
be cancelled/deleted once all requesters have asked for the cancellation.
505+
be cancelled/deleted once all requesters have asked for the cancellation (see
506+
[EvictionRequest Cancellation Examples](#evictionrequest-cancellation-examples) for details).
501507
- Processing the eviction request result by the requester once the eviction process is complete.
502508

503509
If the eviction is no longer needed, the requester should remove itself from the finalizers of the
@@ -549,14 +555,14 @@ annotation. This annotation is parsed into the `Interceptor` type in the [Evicti
549555
characters (`63 - len("priority_")`)
550556
- `PRIORITY` and `ROLE`
551557
- `controller` should always set a `PRIORITY=10000` and `ROLE=controller`.
552-
- Other interceptors should set `PRIORITY` according to their own needs (minimum value is 0,
553-
maximum value is 100000). Higher priorities are selected first by the eviction request
554-
controller. They can use the `controller` interceptor as a reference point, if they want to be
555-
run before or after the `controller` interceptor. They can also observe pod annotations and
556-
detect what other interceptors have been registered for the eviction process. `ROLE` is optional
557-
and can be used as a signal to other interceptors. The `controller` value is reserved for pod
558-
controllers, but otherwise there is no guidance on how the third party interceptors should name
559-
their role.
558+
- Other interceptors should set `PRIORITY` according to their own needs (minimum value (lowest
559+
priority) is 0, maximum value (highest priority) is 100000). Higher priorities are selected
560+
first by the eviction request controller. They can use the `controller` interceptor as a
561+
reference point, if they want to be run before or after the `controller` interceptor. They can
562+
also observe pod annotations and detect what other interceptors have been registered for the
563+
eviction process. `ROLE` is optional and can be used as a signal to other interceptors. The
564+
`controller` value is reserved for pod controllers, but otherwise there is no guidance on how
565+
the third party interceptors should name their role.
560566
- Priorities `9900-10100` are reserved for interceptors with a class that has the same parent
561567
domain as the controller interceptor. Duplicate priorities are not allowed in this interval.
562568
- The number of interceptor annotations is limited to 30 in the 9900-10100 interval and to 70
@@ -609,7 +615,8 @@ it may update the status every 3 minutes. The status updates should look as foll
609615
request process of the pod cannot be stopped/cancelled. This will block any DELETE requests on the
610616
EvictionRequest object. If the interceptor supports eviction request cancellation, it should make
611617
sure that this field is set to `Allow`, and it should be aware that the EvictionRequest object can
612-
be deleted at any time.
618+
be deleted at any time. See
619+
[EvictionRequest Cancellation Examples](#evictionrequest-cancellation-examples) for details.
613620
- Update `.status.expectedInterceptorFinishTime` if a reasonable estimation can be made of how long
614621
the eviction process will take for the current interceptor. This can be modified later to change
615622
the estimate.
@@ -676,7 +683,7 @@ No attempt will be made to evict pods that are currently terminating.
676683
If the pod eviction fails, e.g. due to a blocking PodDisruptionBudget, the
677684
`.status.failedAPIEvictionCounter` is incremented and the pod is added back to the queue with
678685
exponential backoff (maximum approx. 15 minutes). If there is a positive progress update in the
679-
`.status.progressTimestamp` of the EvictionRequest, it will cancel the eviction.
686+
`.status.progressTimestamp` of the EvictionRequest, it will cancel the API-initated eviction.
680687

681688
#### Garbage Collection
682689

@@ -695,6 +702,9 @@ For convenience, we will also remove requester finalizers with
695702
`evictionrequest.coordination.k8s.io/` prefix when the eviction request task is complete (points 2
696703
and 3). Other finalizers will still block deletion.
697704

705+
For convenience, we will set `.status.evictionRequestCancellationPolicy` back to `Allow` if the
706+
value is `Forbid` and the pod has been fully terminated.
707+
698708
### EvictionRequest API
699709

700710
```golang
@@ -908,7 +918,11 @@ The pod labels are merged with the EvictionRequest labels (pod labels have a pre
908918
for custom label selectors when observing the eviction requests.
909919

910920
`.status.activeInterceptorClass` should be empty on creation as its selection should be left on the
911-
eviction request controller.
921+
eviction request controller. To strengthen the validation, we should check that it is possible to
922+
set only the highest priority interceptor in the beginning. After that it is possible to set only
923+
the next interceptor and so on. We can also condition this transition according to the other fields.
924+
`.status.ActiveInterceptorCompleted` should be true or `.status.ProgressTimestamp` has exceeded the
925+
deadline.
912926

913927
`.status.evictionRequestCancellationPolicy` should be `Allow` on creation, as its resolution should be
914928
left to the eviction request controller.
@@ -988,6 +1002,113 @@ The following diagrams describe what the EvictionRequest process will look like
9881002
![eviction-request-process](eviction-request-process.svg)
9891003

9901004

1005+
### EvictionRequest Cancellation Examples
1006+
1007+
Let's assume there is a single pod p-1 of application P with interceptors A and B:
1008+
1009+
```yaml
1010+
apiVersion: v1
1011+
kind: Pod
1012+
metadata:
1013+
annotations:
1014+
interceptor.evictionrequest.coordination.k8s.io/priority_actor-a.k8s.io: "10000/controller"
1015+
interceptor.evictionrequest.coordination.k8s.io/priority_actor-b.k8s.io: "11000/notifier-with-delay"
1016+
name: p-1
1017+
```
1018+
1019+
#### Multiple Dynamic Requesters and No EvictionRequest Cancellation
1020+
1021+
1. A node drain controller starts draining a node Z and makes it unschedulable.
1022+
2. The node drain controller creates an EvictionRequest for the only pod p-1 of application P to
1023+
evict it from a node. It sets the
1024+
`requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer on the
1025+
EvictionRequest.
1026+
3. The descheduling controller notices that the pod p-1 is running in the wrong zone. It wants to
1027+
create an EvictionRequest (named after the pod's UID) for this pod, but the EvictionRequest
1028+
already exists. It sets the
1029+
`requester.evictionrequest.coordination.k8s.io/name_descheduling.avalanche.io` finalizer on the
1030+
EvictionRequest.
1031+
4. The eviction request controller designates Actor B as the next interceptor by updating
1032+
`.status.activeInterceptorClass`.
1033+
5. Actor B updates the EvictionRequest status and also sets
1034+
`.status.evictionRequestCancellationPolicy=Allow`.
1035+
6. Actor B begins notifying users of application P that the application will experience
1036+
a disruption and delays the disruption so that the users can finish their work.
1037+
7. The admin changes his/her mind and cancels the node drain of node Z and makes it schedulable
1038+
again.
1039+
8. The node drain controller removes the
1040+
`requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer from the
1041+
EvictionRequest.
1042+
9. The eviction request controller notices the change in finalizers, but there is still a
1043+
descheduling finalizer, so no action is required.
1044+
10. Actor B sets `ActiveInterceptorCompleted=true` on the eviction requests of pod p-1, which is
1045+
ready to be deleted.
1046+
11. The eviction request controller designates Actor A as the next interceptor by updating
1047+
`.status.activeInterceptorClass`.
1048+
12. Actor A updates the EvictionRequest status and ensures that
1049+
`.status.evictionRequestCancellationPolicy=Allow`
1050+
13. Actor A deletes the p-1 pod.
1051+
14. EvictionRequest is garbage collected once the pods terminate even with the descheduling
1052+
finalizer present.
1053+
1054+
#### Single Dynamic Requester and EvictionRequest Cancellation
1055+
1056+
1. A node drain controller starts draining a node Z and makes it unschedulable.
1057+
2. The node drain controller creates an EvictionRequest for the only pod p-1 of application P to
1058+
evict it from a node. It sets the
1059+
`requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer on the
1060+
EvictionRequest.
1061+
3. The eviction request controller designates Actor B as the next interceptor by updating
1062+
`.status.activeInterceptorClass`.
1063+
4. Actor B updates the EvictionRequest status and also sets
1064+
`.status.evictionRequestCancellationPolicy=Allow`.
1065+
5. Actor B begins notifying users of application P that the application will experience
1066+
a disruption and delays the disruption so that the users can finish their work.
1067+
6. The admin changes his/her mind and cancels the node drain of node Z and makes it schedulable
1068+
again.
1069+
7. The node drain controller removes the
1070+
`requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer from the
1071+
EvictionRequest.
1072+
8. The eviction request controller notices the change in finalizers, and deletes (GC) the
1073+
EvictionRequest as there is no requester present.
1074+
9. Actor B can detect the removal of the EvictionRequest object and notify users of application P
1075+
that the disruption has been cancelled. If it misses the deletion event, then no notification
1076+
will be delivered. To avoid this, Actor B had the option of also setting a finalizer on the
1077+
EvictionRequest.
1078+
1079+
#### Single Dynamic Requester and Forbidden EvictionRequest Cancellation
1080+
1081+
1. A node drain controller starts draining a node Z and makes it unschedulable.
1082+
2. The node drain controller creates an EvictionRequest for the only pod p-1 of application P to
1083+
evict it from a node. It sets the
1084+
`requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer on the
1085+
EvictionRequest.
1086+
3. The eviction request controller designates Actor B as the next interceptor by updating
1087+
`.status.activeInterceptorClass`.
1088+
4. Actor B updates the EvictionRequest status and also sets
1089+
`.status.evictionRequestCancellationPolicy=Forbid` to prevent the EvictionRequest from deletion
1090+
(enforced by API Admission).
1091+
5. Actor B begins notifying users of application P that the application will experience
1092+
a disruption and delays the disruption so that the users can finish their work.
1093+
6. The admin changes his/her mind and cancels the node drain of node Z and makes it schedulable
1094+
again.
1095+
7. The node drain controller removes the
1096+
`requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer from the
1097+
EvictionRequest.
1098+
8. The eviction request controller notices the change in finalizers. Normally it should delete (GC)
1099+
the EvictionRequest as there is no requester present, but
1100+
`.status.evictionRequestCancellationPolicy=Forbid` prevents this.
1101+
9. Actor B sets `ActiveInterceptorCompleted=true` on the eviction requests of pod p-1, which is
1102+
ready to be deleted.
1103+
10. The eviction request controller designates Actor A as the next interceptor by updating
1104+
`.status.activeInterceptorClass`.
1105+
11. Actor A updates the EvictionRequest status and ensures that
1106+
`.status.evictionRequestCancellationPolicy=Forbid`. Alternatively, it could also change it to
1107+
`Allow` at this point, if it was just there, to ensure that Actor B's logic is atomic
1108+
12. Actor A deletes the p-1 pod.
1109+
13. EvictionRequest is garbage collected once the pods terminate. It has to first set
1110+
`.status.evictionRequestCancellationPolicy=Allow` to allow the object to be deleted.
1111+
9911112
### Follow-up Design Details for Kubernetes Workloads
9921113

9931114
Kubernetes Workloads should be made aware of the EvictionRequest API to properly support the
@@ -1095,7 +1216,8 @@ disruption for the underlying application. By scaling up first before terminatin
10951216
3. The node drain controller creates an EvictionRequests for a subset B of pods A to evict them from
10961217
a node.
10971218
4. The eviction request controller designates the deployment controller as the interceptor based on
1098-
the highest priority. No action (termination) is taken on the pods yet.
1219+
the highest priority by updating `.status.activeInterceptorClass`. No action (termination) is
1220+
taken on the pods yet.
10991221
5. The deployment controller creates a set of surge pods C to compensate for the future loss of
11001222
availability of pods B. The new pods are created by temporarily surging the `.spec.replicas`
11011223
count of the underlying replica sets up to the value of deployments `maxSurge`.
@@ -1104,7 +1226,8 @@ disruption for the underlying application. By scaling up first before terminatin
11041226
8. The deployment controller scales down the surging replica sets back to their original value.
11051227
9. The deployment controller sets `ActiveInterceptorCompleted=true` on the eviction requests of
11061228
pods B that are ready to be deleted.
1107-
10. The eviction request controller designates the replica set controller as the next interceptor.
1229+
10. The eviction request controller designates the replica set controller as the next interceptor by
1230+
updating `.status.activeInterceptorClass`.
11081231
11. The replica set controller deletes the pods to which an EvictionRequest object has been
11091232
assigned, preserving the availability of the application.
11101233

@@ -1194,15 +1317,17 @@ first before terminating the pods.
11941317
4. The node drain controller creates an EvictionRequest for the only pod of application W to evict
11951318
it from a node.
11961319
5. The eviction request controller designates the HPA as the interceptor based on the highest
1197-
priority. No action (termination) is taken on the single pod yet.
1320+
priority by updating `.status.activeInterceptorClass`. No action (termination) is taken on the
1321+
single pod yet.
11981322
6. The HPA controller creates a single surge pod B to compensate for the future loss of
11991323
availability of pod A. The new pod is created by temporarily scaling up the deployment.
12001324
7. Pod B is scheduled on a new schedulable node that is not under the node drain.
12011325
8. Pod B becomes available.
12021326
9. The HPA scales the surging deployment back down to 1 replica.
12031327
10. The HPA sets `ActiveInterceptorCompleted=true` on the eviction requests of pod A, which is ready
12041328
to be deleted.
1205-
11. The eviction request controller designates the replica set controller as the next interceptor.
1329+
11. The eviction request controller designates the replica set controller as the next interceptor by
1330+
updating `.status.activeInterceptorClass`.
12061331
12. The replica set controller deletes the pods to which an EvictionRequest object has been
12071332
assigned, preserving the availability of the webserver.
12081333

@@ -1230,11 +1355,13 @@ HPA Downscaling example:
12301355
priority. No action (termination) is taken on the pods yet.
12311356
6. The HPA downscales the Deployment workload.
12321357
7. The HPA sets `ActiveInterceptorCompleted=true` on its own eviction requests.
1233-
8. The eviction request controller designates the deployment controller as the next interceptor.
1358+
8. The eviction request controller designates the deployment controller as the next interceptor by
1359+
updating `.status.activeInterceptorClass`.
12341360
9. The deployment controller subsequently scales down the underlying ReplicaSet(s).
12351361
10. The deployment controller sets `ActiveInterceptorCompleted=true` on the eviction requests of
12361362
pods that are ready to be deleted.
1237-
11. The eviction request controller designates the replica set controller as the next interceptor.
1363+
11. The eviction request controller designates the replica set controller as the next interceptor by
1364+
updating `.status.activeInterceptorClass`.
12381365
12. The replica set controller deletes the pods to which an EvictionRequest object has been
12391366
assigned, preserving the scheduling constraints.
12401367

@@ -1772,6 +1899,7 @@ Pros:
17721899
- Versatility; users can use any name they see fit.
17731900
- `.metadata.generateName` is supported.
17741901
- Actors in the system have a greater incentive to use `.spec.podRef`.
1902+
17751903
Cons:
17761904
- Name conflict resolution is left up to the users, but as a workaround they can simply generate the
17771905
name.

0 commit comments

Comments
 (0)