feat: add configurable node repair unhealthy threshold #2739

clark42 · 2025-12-23T17:48:25Z

What does this PR do?

Adds a new nodeRepairUnhealthyThreshold field to the NodePool.spec.disruption configuration, allowing operators to configure the maximum percentage or count of unhealthy nodes before node auto repair is blocked.

Why is this needed?

The current hardcoded 20% threshold is too restrictive for small NodePools (3-5 nodes). When 2 out of 4 nodes become unhealthy (50%), node auto repair is blocked with the message "more than 20% nodes are unhealthy in the nodepool", leaving the cluster in a degraded state requiring manual intervention.

Real-world use case

In our EKS cluster with Karpenter, we experienced an incident where 2 out of 4 nodes went into NotReady state simultaneously. Despite having Node Auto Repair enabled (nodeRepair: true), Karpenter refused to repair these nodes because 50% > 20% threshold.

The event showed:

NodeRepairBlocked: more than 20% nodes are unhealthy in the nodepool

This left us with a degraded cluster for hours until manual intervention.

How does it work?

New field in NodePool spec

apiVersion: karpenter.sh/v1
kind: NodePool
spec:
  disruption:
    nodeRepairUnhealthyThreshold: "50%"  # or absolute count like "2"

Accepted values

Percentage: "20%", "50%", "100%" (string with % suffix)
Absolute count: "1", "2", "5" (string representing integer)

Default behavior

Default remains "20%" for backward compatibility. Existing NodePools without this field will behave exactly as before.

Comparison with AWS Managed Node Groups

AWS EKS Managed Node Groups offer similar configurability:

maxUnhealthyNodeThresholdCount
maxUnhealthyNodeThresholdPercentage

See: https://docs.aws.amazon.com/eks/latest/userguide/node-health.html

This PR brings similar flexibility to Karpenter.

Changes

Added NodeRepairUnhealthyThreshold field to Disruption struct in pkg/apis/v1/nodepool.go
Modified pkg/controllers/node/health/controller.go to use the NodePool's configured threshold instead of the hardcoded value
Updated event messages to show the actual configured threshold

Testing

Unit tests (to be added)
Manual testing on EKS cluster

Related Issues

Fixes #2134

Checklist

Code compiles correctly
Backward compatible (default is still 20%)
Documentation updates
Unit tests

linux-foundation-easycla · 2025-12-23T17:48:33Z

The committers listed above are authorized under a signed CLA.

✅ login: clark42 / name: Jean EYMERIT (2b65b63, 7e11847, e6c4e83)

k8s-ci-robot · 2025-12-23T17:48:34Z

Welcome @clark42!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-12-23T17:48:35Z

Hi @clark42. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-12-23T17:48:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clark42
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

engedaam · 2025-12-28T20:20:07Z

Instead of introducing additional configuration surface area, would it make sense to provide opinionated default unhealthy thresholds based on cluster size, for example, small (<10 nodes), medium (<50 nodes), and large (<100 nodes) clusters? The specific node counts here are illustrative rather than prescriptive; the underlying idea is that different cluster scales may warrant different threshold categories.

Related to that, should this behavior be defined at the cluster level rather than per-node-pool?

These thresholds were originally introduced in node repair as a safety mechanism to reactively block repairs during large-scale outages. Exposing this value as a configurable knob, especially at the node-pool level, would require users to reason about and tune it for each pool independently, which could make overall node repair behavior harder to understand and reason about across the cluster.

clark42 · 2025-12-30T19:17:03Z

Thanks for the feedback @engedaam!

I completely understand the safety concern - the 20% threshold exists as a guardrail to prevent cascading failures during large-scale outages, and exposing it as a configurable knob could lead users to inadvertently weaken this protection.

Addressing your points:

1. The threshold is already per-NodePool, not cluster-level

Looking at the current implementation in controller.go, the 20% is already evaluated per-NodePool (via isNodePoolHealthy), not at cluster level. This PR doesn't change the scope - it just makes the hardcoded value configurable.

2. Documentation to mitigate the risk

I'd be happy to add documentation explaining:

The purpose of this threshold as a safety mechanism
The risk of cascading failures when setting a higher value
Recommended values based on NodePool size and criticality
A clear warning that increasing this value should only be done with full understanding of the implications

This way, users who need flexibility (small pools, specific use-cases) can tune it, while the documentation ensures they understand the trade-offs.

3. Context: small NodePools and AWS approach

Our use-case: a 4-node pool where 2 nodes became unhealthy. With 20% (rounding to 0-1 node), auto-repair was blocked for 3+ hours. The safety mechanism designed for large-scale outages prevented recovery on a small pool.

AWS recently introduced similar configurability for managed node groups with maxUnhealthyNodeThresholdCount and maxUnhealthyNodeThresholdPercentage (docs), suggesting this need for tunability is recognized.

Alternative if full configurability is still a concern:

I'm also open to implementing automatic thresholds based on NodePool size (as you suggested), for example:

A minimum floor like max(20%, 2 nodes)
Or tiered thresholds: small pools (<10 nodes) → 50%, medium (<50) → 30%, large → 20%

This would preserve safety for large pools while allowing small pools to recover, without requiring user configuration.

Let me know how you'd like to proceed!

Add NodeRepairUnhealthyThreshold field to NodePool's Disruption spec, allowing operators to configure the maximum percentage or count of unhealthy nodes before node auto repair is blocked. This addresses the issue where the hardcoded 20% threshold is too restrictive for small NodePools (3-5 nodes), causing node auto repair to be blocked when only 1-2 nodes become unhealthy. Similar to AWS Managed Node Groups' maxUnhealthyNodeThresholdCount and maxUnhealthyNodeThresholdPercentage configuration options. The field accepts either: - A percentage (e.g., "50%") - An absolute count (e.g., "2") Default remains "20%" for backward compatibility. Fixes kubernetes-sigs#2134

Add tests covering: - Custom percentage threshold on NodePool (50%) - Blocking repair when exceeding percentage threshold (25%) - Absolute count threshold (3) - Blocking repair when exceeding count threshold (2) - Default 20% behavior when threshold not configured

k8s-ci-robot · 2026-01-03T02:18:13Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 23, 2025

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Dec 23, 2025

k8s-ci-robot requested review from njtran and tallaxes December 23, 2025 17:48

clark42 added 3 commits December 30, 2025 20:28

chore: regenerate CRDs after adding nodeRepairUnhealthyThreshold

e6c4e83

clark42 force-pushed the feature/configurable-node-repair-threshold branch from 8055965 to 7e11847 Compare December 30, 2025 19:30

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add configurable node repair unhealthy threshold #2739

feat: add configurable node repair unhealthy threshold #2739

clark42 commented Dec 23, 2025

Uh oh!

linux-foundation-easycla bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Dec 23, 2025

Uh oh!

k8s-ci-robot commented Dec 23, 2025

Uh oh!

k8s-ci-robot commented Dec 23, 2025

Uh oh!

engedaam commented Dec 28, 2025

Uh oh!

clark42 commented Dec 30, 2025

Uh oh!

k8s-ci-robot commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add configurable node repair unhealthy threshold #2739

Are you sure you want to change the base?

feat: add configurable node repair unhealthy threshold #2739

Conversation

clark42 commented Dec 23, 2025

What does this PR do?

Why is this needed?

Real-world use case

How does it work?

New field in NodePool spec

Accepted values

Default behavior

Comparison with AWS Managed Node Groups

Changes

Testing

Related Issues

Checklist

Uh oh!

linux-foundation-easycla bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Dec 23, 2025

Uh oh!

k8s-ci-robot commented Dec 23, 2025

Uh oh!

k8s-ci-robot commented Dec 23, 2025

Uh oh!

engedaam commented Dec 28, 2025

Uh oh!

clark42 commented Dec 30, 2025

Uh oh!

k8s-ci-robot commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linux-foundation-easycla bot commented Dec 23, 2025 •

edited

Loading