Skip to content

Conversation

@gangavh1008
Copy link

@gangavh1008 gangavh1008 commented Dec 8, 2025

feat: implement consolidationGracePeriod to prevent consolidation churn - issue 7146

Fixes #N/A

Description
This PR introduces the consolidationGracePeriod feature to address excessive node churn caused by consolidation cycles. The feature makes nodes "invisible" to the consolidation process for a configurable duration after any pod event (add/remove), preventing both source and destination churn.

Problem

With the existing consolidateAfter mechanism, a problematic consolidation cycle emerges:

  • Karpenter consolidates Node_A, moving pods to Node_B
  • Node_B receives pods → lastPodEventTime updates → Node_B resets its consolidateAfter timer
  • Node_B becomes "unconsolidatable" while other stable nodes become targets
  • Cycle repeats with another stable node being consolidated
  • Result: Constant node churn, with nodes running only 5-10 minutes before disruption

The core issue is that receiving pods from consolidation makes a node unconsolidatable, creating a feedback loop where the destination of one consolidation becomes protected while source nodes become targets.

Solution
New fields in NodePool Disruption spec:

spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
    consolidationGracePeriod: 5m                    # NEW: Node invisibility duration

How it works:

When consolidationGracePeriod is configured:

  • Any pod event (add or remove) on a node updates lastPodEventTime
  • For the duration of consolidationGracePeriod after lastPodEventTime, the node is invisible to consolidation:
    • Cannot be a source (won't be disrupted)
    • Cannot be a destination (won't receive pods during consolidation simulation)
  • The timer resets on every pod event
  • After the grace period expires, normal consolidation rules apply

Changes

API:
Added ConsolidationGracePeriod field to Disruption struct in pkg/apis/v1/nodepool.go

Disruption Logic:

  • pkg/controllers/disruption/helpers.go:
  • Added IsWithinConsolidationGracePeriod() helper function
  • Modified GetCandidates() to filter out nodes within grace period (source filtering)
  • Modified SimulateScheduling() to exclude nodes within grace period from destinations

Updated Controllers:

  • pkg/controllers/disruption/consolidation.go: Pass nodePoolMap and clock to simulation
  • pkg/controllers/disruption/validation.go: Pass nodePoolMap and clock to simulation
  • pkg/controllers/disruption/drift.go: Pass nodePoolMap and clock to simulation
  • pkg/controllers/disruption/controller.go: Updated NewMethods signature

CRDs:

  • Updated pkg/apis/crds/karpenter.sh_nodepools.yaml
  • Updated kwok/charts/crds/karpenter.sh_nodepools.yaml

Documentation:

  • designs/use-on-consolidation-after.md: Design document
  • designs/use-on-consolidation-after-analysis.md: Critical analysis
  • designs/consolidationGracePeriod-test-evidence.md: Test evidence with Karpenter logs

How was this change tested?

Unit Tests:

  • All existing tests pass (232 disruption tests, 47 nodeclaim disruption tests)
  • Verified compilation with go build ./...

EKS Integration Testing:

  • Deployed custom Karpenter image to EKS cluster

  • Tested with consolidationGracePeriod: 90s

  • Verified 3 scenarios:
    ✅ New node protected during grace period, visible after expiration
    ✅ Timer resets on each pod event (add/remove)
    ✅ Multiple operations with grace period protecting nodes during activity

    Migration Path

  • No migration required: Feature is opt-in via new optional field

  • Existing NodePools: Continue to work exactly as before

  • New NodePools: Can opt-in by setting consolidationGracePeriod

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gangavh1008
Once this PR has been reviewed and has the lgtm label, please assign tzneal for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @gangavh1008!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 8, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @gangavh1008. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 8, 2025
- WhenEmpty
- WhenEmptyOrUnderutilized
type: string
useOnConsolidationAfter:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is bit a confusing, would you consider changing it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will change it. thank you for the review.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name to consolidationGracePeriod

When replicas is set, UseOnConsolidationAfter is simply ignored
pattern: ^(([0-9]+(s|m|h))+|Never)$
type: string
useOnConsolidationUtilizationThreshold:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a prior discussion with the maintainers about my proposed way to fix this there was suggestion to focus on cost vs utilization since that's what Karpenter optimizes for and adding a utilization gate would be at odds with the core consolidation logic.

For example in AWS an m5.xlarge instance will be cheaper vs m8.xlarge and I'm sure there's other cases where an instance type with more resources could end up being cheaper vs smaller one. There's also the scenario of reserved instances or otherwise special rates on specific instance types such that it could be preferable to run at lower utilization levels.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the insightful feedback. You raise a valid point that deserves careful consideration.
The Core Concern
You're correct that Karpenter's consolidation logic optimizes for cost, not utilization. The utilization threshold in this PR could conflict with cost optimization in scenarios like:

  • Instance type pricing inversions: m5.xlarge might be cheaper than m8.xlarge
  • Reserved/Savings Plans: Pre-purchased capacity should be used even at low utilization
  • Spot pricing variations: A larger spot instance might be cheaper than a smaller on-demand one

Proposed Solutions
Cost-Aware Protection
Instead of a utilization threshold, protect nodes that are already cost-optimal - meaning Karpenter's consolidation algorithm determined no cheaper alternative exists:

spec:
  disruption:
    useOnConsolidationAfter: 1h
    # Remove utilization threshold entirely
    # Protection applies when node is stable AND cost-optimal

I will change the feature implementation to this approach, and submit the commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the utilization threshold entirely. The feature is now simplified to:

spec:
  disruption:
    consolidateAfter: 30s
    consolidationGracePeriod: 1h  # Simple grace period, no utilization check

The protection logic is now:

  • Node becomes consolidatable (stable for consolidateAfter)
  • Consolidation evaluates the node using its normal cost-based algorithm
  • If consolidation finds a cheaper option → CONSOLIDATE ✅
  • If no cheaper option → Grace period applied (prevents re-evaluation for consolidationGracePeriod)

The feature doesn't try to be smarter than Karpenter's consolidation algorithm. It simply provides a cooldown to prevent churn from repeated re-evaluation.

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ellistarn , please review at your convenience, thank you.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 10, 2025
@ellistarn
Copy link
Contributor

Hey @gangavh1008. I'd suggest coming to alignment with the maintainers in an issue before moving on to implementation.

@gangavh1008
Copy link
Author

Hey @gangavh1008. I'd suggest coming to alignment with the maintainers in an issue before moving on to implementation.

Hi @ellistarn, I agree. I went ahead with implementation, prioritizing the internal requirements.

Requesting maintainers to share the feedback, happy to incorporate the changes.

@ellistarn
Copy link
Contributor

For context, we're trying to think more broadly about this consolidation space, and a bunch of key stakeholders are about to head oit for the holidays. We want to do better here and agree this is a problem -- I am not sure this is the right approach in the specifics.

@gangavh1008
Copy link
Author

For context, we're trying to think more broadly about this consolidation space, and a bunch of key stakeholders are about to head oit for the holidays. We want to do better here and agree this is a problem -- I am not sure this is the right approach in the specifics.

@ellistarn , sure. Thank you for going through the approach. As you suggested, let me put google doc for the consensus building on design with maintainers in the issue.
I'd suggest coming to alignment with the maintainers in an issue before moving on to implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants