You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/content/en/preview/concepts/disruption.md
+34-1Lines changed: 34 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -162,7 +162,6 @@ Behavioral Fields are treated as over-arching settings on the NodePool to dictat
162
162
163
163
Read the [Drift Design](https://github.com/aws/karpenter-core/blob/main/designs/drift.md) for more.
164
164
165
-
To enable the drift feature flag, refer to the [Feature Gates]({{<ref "../reference/settings#feature-gates">}}).
166
165
167
166
Karpenter will add the `Drifted` status condition on NodeClaims if the NodeClaim is drifted from its owning NodePool. Karpenter will also remove the `Drifted` status condition if either:
168
167
1. The `Drift` feature gate is not enabled but the NodeClaim is drifted, Karpenter will remove the status condition.
@@ -198,6 +197,40 @@ Karpenter enables this feature by watching an SQS queue which receives critical
198
197
199
198
To enable interruption handling, configure the `--interruption-queue` CLI argument with the name of the interruption queue provisioned to handle interruption events.
Node Auto Repair is a feature that automatically identifies and replaces unhealthy nodes in your cluster, helping to maintain overall cluster health. Nodes can experience various types of failures affecting their hardware, file systems, or container environments. These failures may be surfaced through node conditions such as network unavailability, disk pressure, memory pressure, or other conditions reported by node diagnostic agents. When Karpenter detects these unhealthy conditions, it automatically replaces the affected nodes based on cloud provider-defined repair policies. Once a node has been in an unhealthy state beyond its configured toleration duration, Karpenter will forcefully terminate the node and its corresponding NodeClaim, bypassing the standard drain and grace period procedures to ensure swift replacement of problematic nodes. To prevent cascading failures, Karpenter includes safety mechanisms: it will not perform repairs if more than 20% of nodes in a NodePool are unhealthy, and for standalone NodeClaims, it evaluates this threshold against all nodes in the cluster. This ensures your cluster remains in a healthy state with minimal manual intervention, even in scenarios where normal node termination procedures might be impacted by the node's unhealthy state.
205
+
206
+
To enable Node Auto Repair:
207
+
1. Ensure you have a [Node Monitoring Agent](https://docs.aws.amazon.com/en_us/eks/latest/userguide/node-health.html) deployed or any agent that will add status conditions to nodes that are supported (e.g., Node Problem Detector)
208
+
2. Enable the feature flag: `NodeRepair=true`
209
+
3. Node AutoRepair will automatically terminate nodes when they have unhealthy status conditions based on your cloud provider's repair policies
210
+
211
+
212
+
Karpenter monitors nodes for the following node status conditions when initiating repair actions:
0 commit comments