-
Notifications
You must be signed in to change notification settings - Fork 408
fix: update disruption metrics even without candidates #2728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: update disruption metrics even without candidates #2728
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: moko-poi The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @moko-poi. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Pull Request Test Coverage Report for Build 20390985201Details
💛 - Coveralls |
| if len(candidates) == 0 { | ||
| return false, nil | ||
| } | ||
| // Always build disruption budget mapping to ensure metrics are up-to-date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right refactor here. Because the metrics are entirely based on cluster state, I think I would prefer if we move the metrics outside of BuildDisruptionBudgetMapping and into the nodepool or node metrics controller. That ensures the metrics will always be emitted no matter the changes to consolidation or disruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DerekFrank
You're absolutely right. I've implemented the refactor following your suggestion. I moved the metrics emission out of BuildDisruptionBudgetMapping and into the nodepool metrics controller, which ensures the metrics are always updated regardless of whether disruption candidates exist or consolidation logic changes.
Specifically, I added cluster state and clock dependencies to the nodepool metrics controller and implemented an updateDisruptionBudgetMetrics method. This method is called for each NodePool during the Reconcile loop, counts nodes directly from cluster state, evaluates schedule-based budgets using the clock, and sets the metrics accordingly. Since the controller requeues every 5 minutes, the metrics are updated periodically independent of the disruption controller's execution.
This change makes the metrics logic completely independent from the disruption controller logic, making it more robust. Thank you for the architectural improvement suggestion.
b8227fb to
a561523
Compare
Gracefully handle cases where NodeClaim reconciliation happens before NodePool informer sync by: - Using %w error wrapping to preserve NotFound status - Downgrading NotFound errors to V(1) info logs - Allowing automatic retry via reconcile loop Prevents test failures from transient informer synchronization lag.
76c2021 to
a07fd4c
Compare
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
fix: update disruption metrics even without candidates
Fixes #2344
Description
The
karpenter_nodepools_allowed_disruptionsmetric was only being updated when disruption candidates existed. This caused the metric to report stale values in several scenarios:Root Cause:
The
BuildDisruptionBudgetMapping()function was called after checking if candidates exist. Whenlen(candidates) == 0, the function returned early without updating metrics.Solution:
Move
BuildDisruptionBudgetMapping()before the candidate check. This ensures metrics are updated on every reconcile loop (every 10 seconds), regardless of whether disruption candidates exist.Impact:
How was this change tested?
go build ./pkg/controllers/disruption/...BuildDisruptionBudgetMapping()is side-effect free except for metrics/events, so moving it earlier is safeExample Scenario Fixed:
Before: Metric shows
2even during blocked windowAfter: Metric correctly shows
0during 10:00-22:00 and2outside that windowBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.