Skip to content

Commit 4a997d9

Browse files
authored
docs: Add a section Node Auto Repair (#7622)
1 parent 143e8c2 commit 4a997d9

File tree

1 file changed

+34
-1
lines changed

1 file changed

+34
-1
lines changed

website/content/en/preview/concepts/disruption.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,6 @@ Behavioral Fields are treated as over-arching settings on the NodePool to dictat
162162
163163
Read the [Drift Design](https://github.com/aws/karpenter-core/blob/main/designs/drift.md) for more.
164164
165-
To enable the drift feature flag, refer to the [Feature Gates]({{<ref "../reference/settings#feature-gates" >}}).
166165
167166
Karpenter will add the `Drifted` status condition on NodeClaims if the NodeClaim is drifted from its owning NodePool. Karpenter will also remove the `Drifted` status condition if either:
168167
1. The `Drift` feature gate is not enabled but the NodeClaim is drifted, Karpenter will remove the status condition.
@@ -198,6 +197,40 @@ Karpenter enables this feature by watching an SQS queue which receives critical
198197
199198
To enable interruption handling, configure the `--interruption-queue` CLI argument with the name of the interruption queue provisioned to handle interruption events.
200199
200+
### Node Auto Repair
201+
202+
<i class="fa-solid fa-circle-info"></i> <b>Feature State: </b> Karpenter v1.1.0 [alpha]({{<ref "../reference/settings#feature-gates" >}})
203+
204+
Node Auto Repair is a feature that automatically identifies and replaces unhealthy nodes in your cluster, helping to maintain overall cluster health. Nodes can experience various types of failures affecting their hardware, file systems, or container environments. These failures may be surfaced through node conditions such as network unavailability, disk pressure, memory pressure, or other conditions reported by node diagnostic agents. When Karpenter detects these unhealthy conditions, it automatically replaces the affected nodes based on cloud provider-defined repair policies. Once a node has been in an unhealthy state beyond its configured toleration duration, Karpenter will forcefully terminate the node and its corresponding NodeClaim, bypassing the standard drain and grace period procedures to ensure swift replacement of problematic nodes. To prevent cascading failures, Karpenter includes safety mechanisms: it will not perform repairs if more than 20% of nodes in a NodePool are unhealthy, and for standalone NodeClaims, it evaluates this threshold against all nodes in the cluster. This ensures your cluster remains in a healthy state with minimal manual intervention, even in scenarios where normal node termination procedures might be impacted by the node's unhealthy state.
205+
206+
To enable Node Auto Repair:
207+
1. Ensure you have a [Node Monitoring Agent](https://docs.aws.amazon.com/en_us/eks/latest/userguide/node-health.html) deployed or any agent that will add status conditions to nodes that are supported (e.g., Node Problem Detector)
208+
2. Enable the feature flag: `NodeRepair=true`
209+
3. Node AutoRepair will automatically terminate nodes when they have unhealthy status conditions based on your cloud provider's repair policies
210+
211+
212+
Karpenter monitors nodes for the following node status conditions when initiating repair actions:
213+
214+
215+
#### Kubelet Node Conditions
216+
217+
| Type | Status | Toleration Duration |
218+
| ------ | ------------- | ------------------- |
219+
| Ready | False | 30 minutes |
220+
| Ready | Unknown | 30 minutes |
221+
222+
#### Node Monitoring Agent Conditions
223+
224+
| Type | Status | Toleration Duration |
225+
| ------------------------ | ------------| --------------------- |
226+
| AcceleratedHardwareReady | False | 10 minutes |
227+
| StorageReady | False | 30 minutes |
228+
| NetworkingReady | False | 30 minutes |
229+
| KernelReady | False | 30 minutes |
230+
| ContainerRuntimeReady | False | 30 minutes |
231+
232+
To enable the drift feature flag, refer to the [Feature Gates]({{<ref "../reference/settings#feature-gates" >}}).
233+
201234
## Controls
202235
203236
### TerminationGracePeriod

0 commit comments

Comments
 (0)