-
Notifications
You must be signed in to change notification settings - Fork 408
feat: add configurable node repair unhealthy threshold #2739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add configurable node repair unhealthy threshold #2739
Conversation
|
Welcome @clark42! |
|
Hi @clark42. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: clark42 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Instead of introducing additional configuration surface area, would it make sense to provide opinionated default unhealthy thresholds based on cluster size, for example, small (<10 nodes), medium (<50 nodes), and large (<100 nodes) clusters? The specific node counts here are illustrative rather than prescriptive; the underlying idea is that different cluster scales may warrant different threshold categories. Related to that, should this behavior be defined at the cluster level rather than per-node-pool? These thresholds were originally introduced in node repair as a safety mechanism to reactively block repairs during large-scale outages. Exposing this value as a configurable knob, especially at the node-pool level, would require users to reason about and tune it for each pool independently, which could make overall node repair behavior harder to understand and reason about across the cluster. |
|
Thanks for the feedback @engedaam! I completely understand the safety concern - the 20% threshold exists as a guardrail to prevent cascading failures during large-scale outages, and exposing it as a configurable knob could lead users to inadvertently weaken this protection. Addressing your points: 1. The threshold is already per-NodePool, not cluster-level Looking at the current implementation in 2. Documentation to mitigate the risk I'd be happy to add documentation explaining:
This way, users who need flexibility (small pools, specific use-cases) can tune it, while the documentation ensures they understand the trade-offs. 3. Context: small NodePools and AWS approach Our use-case: a 4-node pool where 2 nodes became unhealthy. With 20% (rounding to 0-1 node), auto-repair was blocked for 3+ hours. The safety mechanism designed for large-scale outages prevented recovery on a small pool. AWS recently introduced similar configurability for managed node groups with Alternative if full configurability is still a concern: I'm also open to implementing automatic thresholds based on NodePool size (as you suggested), for example:
This would preserve safety for large pools while allowing small pools to recover, without requiring user configuration. Let me know how you'd like to proceed! |
Add NodeRepairUnhealthyThreshold field to NodePool's Disruption spec, allowing operators to configure the maximum percentage or count of unhealthy nodes before node auto repair is blocked. This addresses the issue where the hardcoded 20% threshold is too restrictive for small NodePools (3-5 nodes), causing node auto repair to be blocked when only 1-2 nodes become unhealthy. Similar to AWS Managed Node Groups' maxUnhealthyNodeThresholdCount and maxUnhealthyNodeThresholdPercentage configuration options. The field accepts either: - A percentage (e.g., "50%") - An absolute count (e.g., "2") Default remains "20%" for backward compatibility. Fixes kubernetes-sigs#2134
Add tests covering: - Custom percentage threshold on NodePool (50%) - Blocking repair when exceeding percentage threshold (25%) - Absolute count threshold (3) - Blocking repair when exceeding count threshold (2) - Default 20% behavior when threshold not configured
8055965 to
7e11847
Compare
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What does this PR do?
Adds a new
nodeRepairUnhealthyThresholdfield to theNodePool.spec.disruptionconfiguration, allowing operators to configure the maximum percentage or count of unhealthy nodes before node auto repair is blocked.Why is this needed?
The current hardcoded 20% threshold is too restrictive for small NodePools (3-5 nodes). When 2 out of 4 nodes become unhealthy (50%), node auto repair is blocked with the message "more than 20% nodes are unhealthy in the nodepool", leaving the cluster in a degraded state requiring manual intervention.
Real-world use case
In our EKS cluster with Karpenter, we experienced an incident where 2 out of 4 nodes went into NotReady state simultaneously. Despite having Node Auto Repair enabled (
nodeRepair: true), Karpenter refused to repair these nodes because 50% > 20% threshold.The event showed:
This left us with a degraded cluster for hours until manual intervention.
How does it work?
New field in NodePool spec
Accepted values
"20%","50%","100%"(string with % suffix)"1","2","5"(string representing integer)Default behavior
Default remains
"20%"for backward compatibility. Existing NodePools without this field will behave exactly as before.Comparison with AWS Managed Node Groups
AWS EKS Managed Node Groups offer similar configurability:
maxUnhealthyNodeThresholdCountmaxUnhealthyNodeThresholdPercentageSee: https://docs.aws.amazon.com/eks/latest/userguide/node-health.html
This PR brings similar flexibility to Karpenter.
Changes
NodeRepairUnhealthyThresholdfield toDisruptionstruct inpkg/apis/v1/nodepool.gopkg/controllers/node/health/controller.goto use the NodePool's configured threshold instead of the hardcoded valueTesting
Related Issues
Fixes #2134
Checklist