-
Notifications
You must be signed in to change notification settings - Fork 408
feat: Add prometheus metrics for nodepool disruption decisions performed, active disruptions #2707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add prometheus metrics for nodepool disruption decisions performed, active disruptions #2707
Conversation
…ctive disruptions Signed-off-by: Cameron McAvoy <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cnmcavoy The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 20144144647Details
💛 - Coveralls |
| ConsolidationTypeLabel: cmd.ConsolidationType(), | ||
| }) | ||
| } | ||
| DecisionsPerformedTotal.Inc(map[string]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just add a label to this metric instead of adding a new metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow - are you thinking of using a label on DecisionsPerformedTotal? What about when a command includes multiple nodepools? The signature of the tags is map[string]string, it doesn't take an array of tags. Can we concat the nodepool names together? Open to ideas.
| NodePoolAllowedDisruptions.Set(float64(allowedDisruptions), map[string]string{ | ||
| metrics.NodePoolLabel: nodePool.Name, metrics.ReasonLabel: string(reason), | ||
| }) | ||
| NodePoolActiveDisruptions.Set(float64(disrupting[nodePool.Name]), map[string]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does technically include not ready nodes as well as ones that are disrupting. Does this number have use outside of knowing the end result of:
lo.Max([]int{allowedDisruptions - disrupting[nodePool.Name], 0})?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is intentional, the goal of this PR is to expose Karpenter's state to help operators understand it's decision making process.
Fixes N/A
Description
Adds two new prometheus metrics:
karpenter_nodepools_decisionskarpenter_nodepools_active_disruptionskarpenter_nodepools_decisionstracks the same actions askarpenter_voluntary_disruptions_decisions_total, but with an extra dimension for the nodepool. As a result, the total count ofkarpenter_nodepools_decisionsmay be greater thankarpenter_voluntary_disruptions_decisions_totalwhen 1 decision spans more than 1 nodepool.karpenter_nodepools_active_disruptionstracks the active number of terminating nodeclaims in each nodepool, which is useful to compare against the existing metrickarpenter_nodepools_allowed_disruptions.This change allows us to track how disrupted individual nodepools are, and measure the consolidation in individual nodepools over time.
How was this change tested?
make presubmitand tested in our clusters.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.