docs: Add a section Node Auto Repair (#7622)

engedaam · web-flow · commit 4a997d98ede3 · 2025-01-24T14:53:42.000-08:00
diff --git a/website/content/en/preview/concepts/disruption.md b/website/content/en/preview/concepts/disruption.md
@@ -162,7 +162,6 @@ Behavioral Fields are treated as over-arching settings on the NodePool to dictat
 
 Read the [Drift Design](https://github.com/aws/karpenter-core/blob/main/designs/drift.md) for more.
 
-To enable the drift feature flag, refer to the [Feature Gates]({{<ref "../reference/settings#feature-gates" >}}).
 
 Karpenter will add the `Drifted` status condition on NodeClaims if the NodeClaim is drifted from its owning NodePool. Karpenter will also remove the `Drifted` status condition if either:
 1. The `Drift` feature gate is not enabled but the NodeClaim is drifted, Karpenter will remove the status condition.
@@ -198,6 +197,40 @@ Karpenter enables this feature by watching an SQS queue which receives critical
 
 To enable interruption handling, configure the `--interruption-queue` CLI argument with the name of the interruption queue provisioned to handle interruption events.
 
+### Node Auto Repair 
+
+<i class="fa-solid fa-circle-info"></i> <b>Feature State: </b> Karpenter v1.1.0 [alpha]({{<ref "../reference/settings#feature-gates" >}})
+
+Node Auto Repair is a feature that automatically identifies and replaces unhealthy nodes in your cluster, helping to maintain overall cluster health. Nodes can experience various types of failures affecting their hardware, file systems, or container environments. These failures may be surfaced through node conditions such as network unavailability, disk pressure, memory pressure, or other conditions reported by node diagnostic agents. When Karpenter detects these unhealthy conditions, it automatically replaces the affected nodes based on cloud provider-defined repair policies. Once a node has been in an unhealthy state beyond its configured toleration duration, Karpenter will forcefully terminate the node and its corresponding NodeClaim, bypassing the standard drain and grace period procedures to ensure swift replacement of problematic nodes. To prevent cascading failures, Karpenter includes safety mechanisms: it will not perform repairs if more than 20% of nodes in a NodePool are unhealthy, and for standalone NodeClaims, it evaluates this threshold against all nodes in the cluster. This ensures your cluster remains in a healthy state with minimal manual intervention, even in scenarios where normal node termination procedures might be impacted by the node's unhealthy state.
+
+To enable Node Auto Repair: 
+  1.  Ensure you have a [Node Monitoring Agent](https://docs.aws.amazon.com/en_us/eks/latest/userguide/node-health.html) deployed or any agent that will add status conditions to nodes that are supported (e.g., Node Problem Detector)
+  2.  Enable the feature flag: `NodeRepair=true`
+  3. Node AutoRepair will automatically terminate nodes when they have unhealthy status conditions based on your cloud provider's repair policies
+
+
+Karpenter monitors nodes for the following node status conditions when initiating repair actions:
+
+
+#### Kubelet Node Conditions
+
+|   Type  |    Status     | Toleration Duration | 
+| ------  | ------------- | ------------------- |
+|  Ready  |     False     |     30 minutes      |
+|  Ready  |     Unknown   |     30 minutes      |    
+
+#### Node Monitoring Agent Conditions
+
+|            Type            |    Status     | Toleration Duration | 
+| ------------------------   | ------------| --------------------- |
+|  AcceleratedHardwareReady  |     False   |     10 minutes        |
+|  StorageReady              |     False   |     30 minutes        |    
+|  NetworkingReady           |     False   |     30 minutes        |    
+|  KernelReady               |     False   |     30 minutes        |    
+|  ContainerRuntimeReady     |     False   |     30 minutes        |       
+
+To enable the drift feature flag, refer to the [Feature Gates]({{<ref "../reference/settings#feature-gates" >}}).
+
 ## Controls
 
 ### TerminationGracePeriod