Skip to content

Commit a9cb79d

Browse files
jmdealedibble21
authored andcommitted
docs: update pod level controls for TGP (aws#7710)
1 parent 0e14e94 commit a9cb79d

File tree

5 files changed

+340
-173
lines changed

5 files changed

+340
-173
lines changed

website/content/en/docs/concepts/disruption.md

+72-39
Original file line numberDiff line numberDiff line change
@@ -70,18 +70,14 @@ Automated graceful methods, can be rate limited through [NodePool Disruption Bud
7070
* Nodes can be removed as their workloads will run on other nodes in the cluster.
7171
* Nodes can be replaced with lower priced variants due to a change in the workloads.
7272
* [**Drift**]({{<ref "#drift" >}}): Karpenter will mark nodes as drifted and disrupt nodes that have drifted from their desired specification. See [Drift]({{<ref "#drift" >}}) to see which fields are considered.
73-
* [**Interruption**]({{<ref "#interruption" >}}): Karpenter will watch for upcoming interruption events that could affect your nodes (health events, spot interruption, etc.) and will taint, drain, and terminate the node(s) ahead of the event to reduce workload disruption.
7473

7574
{{% alert title="Defaults" color="secondary" %}}
76-
Disruption is configured through the NodePool's disruption block by the `consolidationPolicy`, and `consolidateAfter` fields. `expireAfter` can also be used to control disruption. Karpenter will configure these fields with the following values by default if they are not set:
75+
Disruption is configured through the NodePool's disruption block by the `consolidationPolicy`, and `consolidateAfter` fields. Karpenter will configure these fields with the following values by default if they are not set:
7776
7877
```yaml
7978
spec:
8079
disruption:
8180
consolidationPolicy: WhenEmptyOrUnderutilized
82-
template:
83-
spec:
84-
expireAfter: 720h
8581
```
8682
{{% /alert %}}
8783
@@ -169,10 +165,22 @@ Karpenter will add the `Drifted` status condition on NodeClaims if the NodeClaim
169165
170166
## Automated Forceful Methods
171167
172-
Automated forceful methods will begin draining nodes as soon as the condition is met. Note that these methods blow past NodePool Disruption Budgets, and do not wait for a pre-spin replacement node to be healthy for the pods to reschedule, unlike the graceful methods mentioned above. Use Pod Disruption Budgets and `do-not-disrupt` on your nodes to rate-limit the speed at which your applications are disrupted.
168+
Automated forceful methods will begin draining nodes as soon as the condition is met.
169+
Unlike the graceful methods mentioned above, these methods can not be rate-limited using [NodePool Disruption Budgets](#nodepool-disruption-budgets), and do not wait for a pre-spin replacement node to be healthy for the pods to reschedule.
170+
Pod disruption budgets may be used to rate-limit application disruption.
173171
174172
### Expiration
175-
Karpenter will disrupt nodes as soon as they're expired after they've lived for the duration of the NodePool's `spec.template.spec.expireAfter`. You can use expiration to periodically recycle nodes due to security concern.
173+
174+
A node is expired once it's lifetime exceeds the duration set on the owning NodeClaim's `spec.expireAfter` field.
175+
Changes to `spec.template.spec.expireAfter` on the owning NodePool will not update the field for existing NodeClaims - it will induce NodeClaim drift and the replacements will have the updated value.
176+
Expiration can be used, in conjunction with [`terminationGracePeriod`](#termination-grace-period), to enforce a maximum Node lifetime.
177+
By default, `expireAfter` is set to `720h` (30 days).
178+
179+
{{% alert title="Warning" color="warning" %}}
180+
Misconfigured PDBs and pods with the `karpenter.sh/do-not-disrupt` annotation may block draining indefinitely.
181+
For this reason, it is not recommended to set `expireAfter` without also setting `terminationGracePeriod` **if** your cluster has pods with the `karpenter.sh/do-not-disrupt` annotation.
182+
Doing so can result in partially drained nodes stuck in the cluster, driving up cluster cost and potentially requiring manual intervention to resolve.
183+
{{% /alert %}}
176184
177185
### Interruption
178186
@@ -197,13 +205,13 @@ Karpenter enables this feature by watching an SQS queue which receives critical
197205
198206
To enable interruption handling, configure the `--interruption-queue` CLI argument with the name of the interruption queue provisioned to handle interruption events.
199207
200-
### Node Auto Repair
208+
### Node Auto Repair
201209
202210
<i class="fa-solid fa-circle-info"></i> <b>Feature State: </b> Karpenter v1.1.0 [alpha]({{<ref "../reference/settings#feature-gates" >}})
203211
204212
Node Auto Repair is a feature that automatically identifies and replaces unhealthy nodes in your cluster, helping to maintain overall cluster health. Nodes can experience various types of failures affecting their hardware, file systems, or container environments. These failures may be surfaced through node conditions such as network unavailability, disk pressure, memory pressure, or other conditions reported by node diagnostic agents. When Karpenter detects these unhealthy conditions, it automatically replaces the affected nodes based on cloud provider-defined repair policies. Once a node has been in an unhealthy state beyond its configured toleration duration, Karpenter will forcefully terminate the node and its corresponding NodeClaim, bypassing the standard drain and grace period procedures to ensure swift replacement of problematic nodes. To prevent cascading failures, Karpenter includes safety mechanisms: it will not perform repairs if more than 20% of nodes in a NodePool are unhealthy, and for standalone NodeClaims, it evaluates this threshold against all nodes in the cluster. This ensures your cluster remains in a healthy state with minimal manual intervention, even in scenarios where normal node termination procedures might be impacted by the node's unhealthy state.
205213
206-
To enable Node Auto Repair:
214+
To enable Node Auto Repair:
207215
1. Ensure you have a [Node Monitoring Agent](https://docs.aws.amazon.com/en_us/eks/latest/userguide/node-health.html) deployed or any agent that will add status conditions to nodes that are supported (e.g., Node Problem Detector)
208216
2. Enable the feature flag: `NodeRepair=true`
209217
3. Node AutoRepair will automatically terminate nodes when they have unhealthy status conditions based on your cloud provider's repair policies
@@ -214,36 +222,58 @@ Karpenter monitors nodes for the following node status conditions when initiatin
214222
215223
#### Kubelet Node Conditions
216224
217-
| Type | Status | Toleration Duration |
225+
| Type | Status | Toleration Duration |
218226
| ------ | ------------- | ------------------- |
219227
| Ready | False | 30 minutes |
220-
| Ready | Unknown | 30 minutes |
228+
| Ready | Unknown | 30 minutes |
221229
222230
#### Node Monitoring Agent Conditions
223231
224-
| Type | Status | Toleration Duration |
232+
| Type | Status | Toleration Duration |
225233
| ------------------------ | ------------| --------------------- |
226234
| AcceleratedHardwareReady | False | 10 minutes |
227-
| StorageReady | False | 30 minutes |
228-
| NetworkingReady | False | 30 minutes |
229-
| KernelReady | False | 30 minutes |
230-
| ContainerRuntimeReady | False | 30 minutes |
235+
| StorageReady | False | 30 minutes |
236+
| NetworkingReady | False | 30 minutes |
237+
| KernelReady | False | 30 minutes |
238+
| ContainerRuntimeReady | False | 30 minutes |
231239
232240
To enable the drift feature flag, refer to the [Feature Gates]({{<ref "../reference/settings#feature-gates" >}}).
233241
234242
## Controls
235243
236-
### TerminationGracePeriod
244+
### TerminationGracePeriod
237245
238-
You can set a NodePool's `terminationGracePeriod` through the `spec.template.spec.terminationGracePeriod` field. This field defines the duration of time that a node can be draining before it's forcibly deleted. A node begins draining when it's deleted. Pods will be deleted preemptively based on its TerminationGracePeriodSeconds before this terminationGracePeriod ends to give as much time to cleanup as possible. Note that if your pod's terminationGracePeriodSeconds is larger than this terminationGracePeriod, Karpenter may forcibly delete the pod before it has its full terminationGracePeriod to cleanup.
246+
To configure a maximum termination duration, `terminationGracePeriod` should be used.
247+
It is configured through a NodePool's [`spec.template.spec.terminationGracePeriod`]({{<ref "../concepts/nodepools/#spectemplatespecterminationgraceperiod" >}}) field, and is persisted to created NodeClaims (`spec.terminationGracePeriod`).
248+
Changes to the [`spec.template.spec.terminationGracePeriod`]({{<ref "../concepts/nodepools/#spectemplatespecterminationgraceperiod" >}}) field on the NodePool will not result in a change for existing NodeClaims - it will induce NodeClaim drift and the replacements will have the updated `terminationGracePeriod`.
239249
240-
This is especially useful in combination with `nodepool.spec.template.spec.expireAfter` to define an absolute maximum on the lifetime of a node, where a node is deleted at `expireAfter` and finishes draining within the `terminationGracePeriod` thereafter. Pods blocking eviction like PDBs and do-not-disrupt will block full draining until the `terminationGracePeriod` is reached.
250+
Once a node is disrupted, via either a [graceful](#automated-graceful-methods) or [forceful](#automated-forceful-methods) disruption method, Karpenter will being draining the node.
251+
At this point, the countdown for `terminationGracePeriod` begins.
252+
Once the `terminationGracePeriod` elapses, remaining pods will be forcibly deleted and the unerlying instance will be terminated.
253+
A node may be terminated before the `terminationGracePeriod` has elapsed if all disruptable pods have been drained.
254+
255+
In conjunction with `expireAfter`, `terminationGracePeriod` can be used to enforce an absolute maximum node lifetime.
256+
The node will begin to drain once its `expireAfter` has elapsed, and it will be forcibly terminated once its `terminationGracePeriod` has elapsed, making the maximum node lifetime the sum of the two fields.
257+
258+
Additionally, configuring `terminationGracePeriod` changes the eligibility criteria for disruption via `Drift`.
259+
When configured, a node may be disrupted via drift even if there are pods with blocking PDBs or the `karpenter.sh/do-not-disrupt` annotation scheduled to it.
260+
This enables cluster administrators to ensure crucial updates (e.g. AMI updates addressing CVEs) can't be blocked by misconfigured applications.
261+
262+
{{% alert title="Warning" color="warning" %}}
263+
To ensure that the `terminationGracePeriodSeconds` value for draining pods is respected, pods will be preemptively deleted before the Node's `terminationGracePeriod` has elapsed.
264+
This includes pods with blocking [pod disruption budgets](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) or the [`karpenter.sh/do-not-disrupt` annotation]({{<ref "#pod-level-controls" >}}).
241265
242-
For instance, a NodeClaim with `terminationGracePeriod` set to `1h` and an `expireAfter` set to `23h` will begin draining after it's lived for `23h`. Let's say a `do-not-disrupt` pod has `TerminationGracePeriodSeconds` set to `300` seconds. If the node hasn't been fully drained after `55m`, Karpenter will delete the pod to allow it's full `terminationGracePeriodSeconds` to cleanup. If no pods are blocking draining, Karpenter will cleanup the node as soon as the node is fully drained, rather than waiting for the NodeClaim's `terminationGracePeriod` to finish.
266+
Consider the following example: a Node with a 1 hour `terminationGracePeriod` has been disrupted and begins to drain.
267+
A pod with the `karpenter.sh/do-not-disrupt` annotation and a 300 second (5 minute) `terminationGracePeriodsSeconds` is scheduled to it.
268+
If the pod is still running 55 minutes after the Node begins to drain, the pod will be deleted to ensure its `terminationGracePeriodSeconds` value is respected.
269+
270+
If a pod's `terminationGracePeriodSeconds` value exceeds that of the Node it is scheduled to, Karpenter will prioritize the Node's `terminationGracePeriod`.
271+
The pod will be deleted as soon as the Node begins to drain, and it will not receive it's full `terminationGracePeriodSeconds`.
272+
{{% /alert %}}
243273
244274
### NodePool Disruption Budgets
245275
246-
You can rate limit Karpenter's disruption through the NodePool's `spec.disruption.budgets`. If undefined, Karpenter will default to one budget with `nodes: 10%`. Budgets will consider nodes that are actively being deleted for any reason, and will only block Karpenter from disrupting nodes voluntarily through drift, emptiness, and consolidation. Note that NodePool Disruption Budgets do not prevent Karpenter from terminating expired nodes.
276+
You can rate limit Karpenter's disruption through the NodePool's `spec.disruption.budgets`. If undefined, Karpenter will default to one budget with `nodes: 10%`. Budgets will consider nodes that are actively being deleted for any reason, and will only block Karpenter from disrupting nodes voluntarily through drift, emptiness, and consolidation. Note that NodePool Disruption Budgets do not prevent Karpenter from terminating expired nodes.
247277
248278
#### Reasons
249279
Karpenter allows specifying if a budget applies to any of `Drifted`, `Underutilized`, or `Empty`. When a budget has no reasons, it's assumed that it applies to all reasons. When calculating allowed disruptions for a given reason, Karpenter will take the minimum of the budgets that have listed the reason or have left reasons undefined.
@@ -256,29 +286,26 @@ If the budget is configured with a percentage value, such as `20%`, Karpenter wi
256286
For example, the following NodePool with three budgets defines the following requirements:
257287
- The first budget will only allow 20% of nodes owned by that NodePool to be disrupted if it's empty or drifted. For instance, if there were 19 nodes owned by the NodePool, 4 empty or drifted nodes could be disrupted, rounding up from `19 * .2 = 3.8`.
258288
- The second budget acts as a ceiling to the previous budget, only allowing 5 disruptions when there are more than 25 nodes.
259-
- The last budget only blocks disruptions during the first 10 minutes of the day, where 0 disruptions are allowed, only applying to underutilized nodes.
289+
- The last budget only blocks disruptions during the first 10 minutes of the day, where 0 disruptions are allowed, only applying to underutilized nodes.
260290
261291
```yaml
262292
apiVersion: karpenter.sh/v1
263293
kind: NodePool
264294
metadata:
265295
name: default
266296
spec:
267-
template:
268-
spec:
269-
expireAfter: 720h # 30 * 24h = 720h
270297
disruption:
271298
consolidationPolicy: WhenEmptyOrUnderutilized
272299
budgets:
273300
- nodes: "20%"
274-
reasons:
301+
reasons:
275302
- "Empty"
276303
- "Drifted"
277304
- nodes: "5"
278305
- nodes: "0"
279306
schedule: "@daily"
280307
duration: 10m
281-
reasons:
308+
reasons:
282309
- "Underutilized"
283310
```
284311
@@ -307,8 +334,18 @@ Duration and Schedule must be defined together. When omitted, the budget is alwa
307334
308335
### Pod-Level Controls
309336
310-
You can block Karpenter from voluntarily choosing to disrupt certain pods by setting the `karpenter.sh/do-not-disrupt: "true"` annotation on the pod. This is useful for pods that you want to run from start to finish without disruption. By opting pods out of this disruption, you are telling Karpenter that it should not voluntarily remove a node containing this pod.
311-
337+
You can block Karpenter from voluntarily disrupting and draining pods by adding the `karpenter.sh/do-not-disrupt: "true"` annotation to the pod.
338+
You can treat this annotation as a single-pod, permanently blocking PDB.
339+
This has the following consequences:
340+
- Nodes with `karpenter.sh/do-not-disrupt` pods will be excluded from [Consolidation]({{<ref "#consolidation" >}}), and conditionally excluded from [Drift]({{<ref "#drift" >}}).
341+
- If the Node's owning NodeClaim has a [`terminationGracePeriod`]({{<ref "#terminationgraceperiod" >}}) configured, it will still be eligible for disruption via drift.
342+
- Like pods with a blocking PDB, pods with the `karpenter.sh/do-not-disrupt` annotation will **not** be gracefully evicted by the [Termination Controller]({{ref "#terminationcontroller"}}).
343+
Karpenter will not be able to complete termination of the node until one of the following conditions is met:
344+
- All pods with the `karpenter.sh/do-not-disrupt` annotation are removed.
345+
- All pods with the `karpenter.sh/do-not-disrupt` annotation have entered a [terminal phase](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase) (`Succeeded` or `Failed`).
346+
- The owning NodeClaim's [`terminationGracePeriod`]({{<ref "#terminationgraceperiod" >}}) has elapsed.
347+
348+
This is useful for pods that you want to run from start to finish without disruption.
312349
Examples of pods that you might want to opt-out of disruption include an interactive game that you don't want to interrupt or a long batch job (such as you might have with machine learning) that would need to start over if it were interrupted.
313350
314351
```yaml
@@ -322,20 +359,16 @@ spec:
322359
```
323360
324361
{{% alert title="Note" color="primary" %}}
325-
This annotation will be ignored for [terminating pods](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase) and [terminal pods](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase) (Failed/Succeeded).
326-
{{% /alert %}}
327-
328-
Examples of voluntary node removal that will be prevented by this annotation include:
329-
- [Consolidation]({{<ref "#consolidation" >}})
330-
- [Drift]({{<ref "#drift" >}})
331-
332-
{{% alert title="Note" color="primary" %}}
333-
Voluntary node removal does not include [Interruption]({{<ref "#interruption" >}}) or manual deletion initiated through `kubectl delete node`. Both of these are considered involuntary events, since node removal cannot be delayed.
362+
The `karpenter.sh/do-not-disrupt` annotation does **not** exclude nodes from the forceful disruption methods: [Expiration]({{<ref "#expiration" >}}), [Interruption]({{<ref "#interruption" >}}), [Node Repair](<ref "#node-repair" >), and manual deletion (e.g. `kubectl delete node ...`).
363+
While both interruption and node repair have implicit upper-bounds on termination time, expiration and manual termination do not.
364+
Manual intervention may be required to unblock node termination, by removing pods with the `karpenter.sh/do-not-disrupt` annotation.
365+
For this reason, it is not recommended to use the `karpenter.sh/do-not-disrupt` annotation with `expireAfter` **if** you have not also configured `terminationGracePeriod`.
334366
{{% /alert %}}
335367
336368
### Node-Level Controls
337369
338-
You can block Karpenter from voluntarily choosing to disrupt certain nodes by setting the `karpenter.sh/do-not-disrupt: "true"` annotation on the node. This will prevent disruption actions on the node.
370+
You can block Karpenter from voluntarily choosing to disrupt certain nodes by setting the `karpenter.sh/do-not-disrupt: "true"` annotation on the node.
371+
This will prevent voluntary disruption actions against the node.
339372
340373
```yaml
341374
apiVersion: v1

0 commit comments

Comments
 (0)