Skip to content

Conversation

@clark42
Copy link

@clark42 clark42 commented Dec 23, 2025

What does this PR do?

Adds a new nodeRepairUnhealthyThreshold field to the NodePool.spec.disruption configuration, allowing operators to configure the maximum percentage or count of unhealthy nodes before node auto repair is blocked.

Why is this needed?

The current hardcoded 20% threshold is too restrictive for small NodePools (3-5 nodes). When 2 out of 4 nodes become unhealthy (50%), node auto repair is blocked with the message "more than 20% nodes are unhealthy in the nodepool", leaving the cluster in a degraded state requiring manual intervention.

Real-world use case

In our EKS cluster with Karpenter, we experienced an incident where 2 out of 4 nodes went into NotReady state simultaneously. Despite having Node Auto Repair enabled (nodeRepair: true), Karpenter refused to repair these nodes because 50% > 20% threshold.

The event showed:

NodeRepairBlocked: more than 20% nodes are unhealthy in the nodepool

This left us with a degraded cluster for hours until manual intervention.

How does it work?

New field in NodePool spec

apiVersion: karpenter.sh/v1
kind: NodePool
spec:
  disruption:
    nodeRepairUnhealthyThreshold: "50%"  # or absolute count like "2"

Accepted values

  • Percentage: "20%", "50%", "100%" (string with % suffix)
  • Absolute count: "1", "2", "5" (string representing integer)

Default behavior

Default remains "20%" for backward compatibility. Existing NodePools without this field will behave exactly as before.

Comparison with AWS Managed Node Groups

AWS EKS Managed Node Groups offer similar configurability:

  • maxUnhealthyNodeThresholdCount
  • maxUnhealthyNodeThresholdPercentage

See: https://docs.aws.amazon.com/eks/latest/userguide/node-health.html

This PR brings similar flexibility to Karpenter.

Changes

  • Added NodeRepairUnhealthyThreshold field to Disruption struct in pkg/apis/v1/nodepool.go
  • Modified pkg/controllers/node/health/controller.go to use the NodePool's configured threshold instead of the hardcoded value
  • Updated event messages to show the actual configured threshold

Testing

  • Unit tests (to be added)
  • Manual testing on EKS cluster

Related Issues

Fixes #2134

Checklist

  • Code compiles correctly
  • Backward compatible (default is still 20%)
  • Documentation updates
  • Unit tests

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Dec 23, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

Welcome @clark42!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 23, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @clark42. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Dec 23, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clark42
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 23, 2025
@engedaam
Copy link
Contributor

Instead of introducing additional configuration surface area, would it make sense to provide opinionated default unhealthy thresholds based on cluster size, for example, small (<10 nodes), medium (<50 nodes), and large (<100 nodes) clusters? The specific node counts here are illustrative rather than prescriptive; the underlying idea is that different cluster scales may warrant different threshold categories.

Related to that, should this behavior be defined at the cluster level rather than per-node-pool?

These thresholds were originally introduced in node repair as a safety mechanism to reactively block repairs during large-scale outages. Exposing this value as a configurable knob, especially at the node-pool level, would require users to reason about and tune it for each pool independently, which could make overall node repair behavior harder to understand and reason about across the cluster.

@clark42
Copy link
Author

clark42 commented Dec 30, 2025

Thanks for the feedback @engedaam!

I completely understand the safety concern - the 20% threshold exists as a guardrail to prevent cascading failures during large-scale outages, and exposing it as a configurable knob could lead users to inadvertently weaken this protection.

Addressing your points:

1. The threshold is already per-NodePool, not cluster-level

Looking at the current implementation in controller.go, the 20% is already evaluated per-NodePool (via isNodePoolHealthy), not at cluster level. This PR doesn't change the scope - it just makes the hardcoded value configurable.

2. Documentation to mitigate the risk

I'd be happy to add documentation explaining:

  • The purpose of this threshold as a safety mechanism
  • The risk of cascading failures when setting a higher value
  • Recommended values based on NodePool size and criticality
  • A clear warning that increasing this value should only be done with full understanding of the implications

This way, users who need flexibility (small pools, specific use-cases) can tune it, while the documentation ensures they understand the trade-offs.

3. Context: small NodePools and AWS approach

Our use-case: a 4-node pool where 2 nodes became unhealthy. With 20% (rounding to 0-1 node), auto-repair was blocked for 3+ hours. The safety mechanism designed for large-scale outages prevented recovery on a small pool.

AWS recently introduced similar configurability for managed node groups with maxUnhealthyNodeThresholdCount and maxUnhealthyNodeThresholdPercentage (docs), suggesting this need for tunability is recognized.

Alternative if full configurability is still a concern:

I'm also open to implementing automatic thresholds based on NodePool size (as you suggested), for example:

  • A minimum floor like max(20%, 2 nodes)
  • Or tiered thresholds: small pools (<10 nodes) → 50%, medium (<50) → 30%, large → 20%

This would preserve safety for large pools while allowing small pools to recover, without requiring user configuration.

Let me know how you'd like to proceed!

Add NodeRepairUnhealthyThreshold field to NodePool's Disruption spec,
allowing operators to configure the maximum percentage or count of
unhealthy nodes before node auto repair is blocked.

This addresses the issue where the hardcoded 20% threshold is too
restrictive for small NodePools (3-5 nodes), causing node auto repair
to be blocked when only 1-2 nodes become unhealthy.

Similar to AWS Managed Node Groups' maxUnhealthyNodeThresholdCount
and maxUnhealthyNodeThresholdPercentage configuration options.

The field accepts either:
- A percentage (e.g., "50%")
- An absolute count (e.g., "2")

Default remains "20%" for backward compatibility.

Fixes kubernetes-sigs#2134
Add tests covering:
- Custom percentage threshold on NodePool (50%)
- Blocking repair when exceeding percentage threshold (25%)
- Absolute count threshold (3)
- Blocking repair when exceeding count threshold (2)
- Default 20% behavior when threshold not configured
@clark42 clark42 force-pushed the feature/configurable-node-repair-threshold branch from 8055965 to 7e11847 Compare December 30, 2025 19:30
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 3, 2026
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto Repair: Make % of unhealthy nodes configurable

3 participants