Skip to content

Conversation

@moko-poi
Copy link
Contributor

fix: update disruption metrics even without candidates

Fixes #2344

Description

The karpenter_nodepools_allowed_disruptions metric was only being updated when disruption candidates existed. This caused the metric to report stale values in several scenarios:

  • When no nodes are eligible for disruption
  • When schedule-based budgets change (e.g., entering/exiting disruption windows)
  • When NodePools are deleted but their metrics remain

Root Cause:
The BuildDisruptionBudgetMapping() function was called after checking if candidates exist. When len(candidates) == 0, the function returned early without updating metrics.

Solution:
Move BuildDisruptionBudgetMapping() before the candidate check. This ensures metrics are updated on every reconcile loop (every 10 seconds), regardless of whether disruption candidates exist.

Impact:

  • Metrics now accurately reflect schedule-based budget restrictions in real-time
  • Deleted NodePools naturally stop being reported (only active pools are updated)
  • No breaking changes - existing behavior with candidates is preserved
  • Minimal performance impact - same function calls, just reordered

How was this change tested?

  • Code compiles successfully: go build ./pkg/controllers/disruption/...
  • No new test failures - existing tests pass as candidate behavior is unchanged
  • Manual verification: metric update logic is triggered for all disruption reasons during reconcile, independent of candidate existence
  • Logical validation: BuildDisruptionBudgetMapping() is side-effect free except for metrics/events, so moving it earlier is safe

Example Scenario Fixed:

spec:
  disruption:
    budgets:
    - duration: 12h00m0s
      nodes: "0"
      schedule: 0 10 * * mon-fri  # Block disruption 10:00-22:00 weekdays
    - nodes: "2"

Before: Metric shows 2 even during blocked window
After: Metric correctly shows 0 during 10:00-22:00 and 2 outside that window

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: moko-poi
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 20, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @moko-poi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 20, 2025
@coveralls
Copy link

Pull Request Test Coverage Report for Build 20390985201

Details

  • 5 of 5 (100.0%) changed or added relevant lines in 1 file are covered.
  • 5 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.005%) to 80.325%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/drift.go 2 88.0%
pkg/controllers/provisioning/scheduling/nodeclaim.go 3 89.63%
Totals Coverage Status
Change from base Build 20384548641: -0.005%
Covered Lines: 11966
Relevant Lines: 14897

💛 - Coveralls

if len(candidates) == 0 {
return false, nil
}
// Always build disruption budget mapping to ensure metrics are up-to-date,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the right refactor here. Because the metrics are entirely based on cluster state, I think I would prefer if we move the metrics outside of BuildDisruptionBudgetMapping and into the nodepool or node metrics controller. That ensures the metrics will always be emitted no matter the changes to consolidation or disruption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DerekFrank
You're absolutely right. I've implemented the refactor following your suggestion. I moved the metrics emission out of BuildDisruptionBudgetMapping and into the nodepool metrics controller, which ensures the metrics are always updated regardless of whether disruption candidates exist or consolidation logic changes.

Specifically, I added cluster state and clock dependencies to the nodepool metrics controller and implemented an updateDisruptionBudgetMetrics method. This method is called for each NodePool during the Reconcile loop, counts nodes directly from cluster state, evaluates schedule-based budgets using the clock, and sets the metrics accordingly. Since the controller requeues every 5 minutes, the metrics are updated periodically independent of the disruption controller's execution.

This change makes the metrics logic completely independent from the disruption controller logic, making it more robust. Thank you for the architectural improvement suggestion.

b8227fb

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 30, 2025
@moko-poi moko-poi requested a review from DerekFrank December 30, 2025 07:45
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 1, 2026
@moko-poi moko-poi force-pushed the fix/metrics-update-without-candidates-2344 branch from b8227fb to a561523 Compare January 1, 2026 09:12
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 1, 2026
Gracefully handle cases where NodeClaim reconciliation happens before
NodePool informer sync by:
- Using %w error wrapping to preserve NotFound status
- Downgrading NotFound errors to V(1) info logs
- Allowing automatic retry via reconcile loop

Prevents test failures from transient informer synchronization lag.
@moko-poi moko-poi force-pushed the fix/metrics-update-without-candidates-2344 branch from 76c2021 to a07fd4c Compare January 2, 2026 01:26
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

karpenter_nodepools_allowed_disruptions metrics is mismatching actual budget schedule or schedule is not working

4 participants