metrics: add status counter for NNCP and NNCE#1458
metrics: add status counter for NNCP and NNCE#1458kubevirt-bot merged 7 commits intonmstate:mainfrom
Conversation
With this PR we are adding the following metrics: * kubernetes_nmstate_policies_status * kubernetes_nmstate_enactments_status which count respectively NNCPs grouped by their active status condition and NNCEs grouped by node and the active status condition. This is targeted at users who perfer to track status of NNCPs using their monitoring stack instead of `kubectl get nncp [...]` CLI. In order to keep the metric cardinality at the sane level, we are not opting out for NNCP/NNCE name in the metric, only the aggregated count. Signed-off-by: Mat Kowalski <mko@redhat.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces new monitoring capabilities by exposing Prometheus metrics for the status of NodeNetworkConfigurationPolicies and NodeNetworkConfigurationEnactments. This enhancement provides users with a more integrated way to observe the health and progress of network configurations across their Kubernetes clusters, moving beyond command-line inspection to a more robust, monitoring-stack-friendly approach. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces two new metrics, kubernetes_nmstate_policies_status and kubernetes_nmstate_enactments_status, for monitoring the status of NodeNetworkConfigurationPolicies (NNCPs) and NodeNetworkConfigurationEnactments (NNCEs). This is implemented by adding two new metrics controllers. The logic for collecting and reporting metrics is sound. However, I've identified a significant issue where the NNCP metrics controller is using a deprecated API version (v1beta1) instead of the current v1, which could prevent it from functioning correctly. I've also included a suggestion to improve code organization.
controllers/metrics/nodenetworkconfigurationpolicy_controller.go
Outdated
Show resolved
Hide resolved
controllers/metrics/nodenetworkconfigurationpolicy_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Mat Kowalski <mko@redhat.com>
Signed-off-by: Mat Kowalski <mko@redhat.com>
|
@gemini-code-assist full review |
There was a problem hiding this comment.
Code Review
This pull request introduces new metrics to track the status of NodeNetworkConfigurationPolicy and NodeNetworkConfigurationEnactment resources, which is a great enhancement for monitoring. The implementation adds new controllers for these metrics and includes comprehensive e2e tests.
My review focuses on improving the efficiency of the new metric controllers. I've identified that the UpdateFunc predicates in the new controllers can be optimized to reduce unnecessary reconciliations, especially since they currently trigger on timestamp updates. Please see my specific comments for suggestions on how to refine this.
controllers/metrics/nodenetworkconfigurationenactment_status_controller.go
Outdated
Show resolved
Hide resolved
controllers/metrics/nodenetworkconfigurationpolicy_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Mat Kowalski <mko@redhat.com>
Signed-off-by: Mat Kowalski <mko@redhat.com>
|
@gemini-code-assist full review |
There was a problem hiding this comment.
Code Review
This pull request introduces new metrics to track the status of NodeNetworkConfigurationPolicy and NodeNetworkConfigurationEnactment resources, which is a valuable addition for monitoring. The implementation includes new controllers for these metrics, along with E2E tests.
The overall approach is solid, using controller-runtime predicates to efficiently trigger reconciliations. However, I've identified a potential issue in the NodeNetworkConfigurationEnactmentStatusReconciler where metric series for removed nodes might not be fully cleaned up if they have status types not in the predefined list. My review includes a detailed comment on this with a suggested fix.
| func (r *NodeNetworkConfigurationEnactmentStatusReconciler) reportStatistics(ctx context.Context) error { | ||
| nnceList := nmstatev1beta1.NodeNetworkConfigurationEnactmentList{} | ||
| if err := r.List(ctx, &nnceList); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| counts := make(map[enactmentStatusKey]float64) | ||
| newNodes := make(map[string]struct{}) | ||
|
|
||
| for i := range nnceList.Items { | ||
| nodeName := nnceList.Items[i].Labels[shared.EnactmentNodeLabel] | ||
| if nodeName == "" { | ||
| continue | ||
| } | ||
| newNodes[nodeName] = struct{}{} | ||
|
|
||
| status := activeConditionType(nnceList.Items[i].Status.Conditions) | ||
| if status != "" { | ||
| key := enactmentStatusKey{node: nodeName, status: string(status)} | ||
| counts[key]++ | ||
| } | ||
| } | ||
|
|
||
| // Reset all known node+status combinations, then set current values | ||
| for nodeName := range newNodes { | ||
| for _, condType := range shared.NodeNetworkConfigurationEnactmentConditionTypes { | ||
| key := enactmentStatusKey{node: nodeName, status: string(condType)} | ||
| monitoring.EnactmentStatus.WithLabelValues(nodeName, string(condType)).Set(counts[key]) | ||
| } | ||
| } | ||
|
|
||
| // Delete metrics for nodes that no longer have any enactments | ||
| for oldNode := range r.oldNodes { | ||
| if _, exists := newNodes[oldNode]; !exists { | ||
| for _, condType := range shared.NodeNetworkConfigurationEnactmentConditionTypes { | ||
| monitoring.EnactmentStatus.Delete(prometheus.Labels{ | ||
| "node": oldNode, | ||
| "status": string(condType), | ||
| }) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| r.oldNodes = newNodes | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
The current implementation for cleaning up metrics for removed nodes has a potential issue. It iterates over the predefined shared.NodeNetworkConfigurationEnactmentConditionTypes to delete metric series. If an enactment on a node that is later removed had a status condition type not present in this list, its corresponding metric series will not be deleted, leading to stale metrics.
A more robust approach would be to track all statuses that have been reported for each node.
I suggest modifying the reconciler's state and the reportStatistics logic:
-
Change the
oldNodesfield inNodeNetworkConfigurationEnactmentStatusReconcilerto track statuses per node:// oldState map[string]map[string]struct{} // from map[string]struct{} -
In
reportStatistics, dynamically build a set of all statuses seen in the current reconciliation (from bothshared.NodeNetworkConfigurationEnactmentConditionTypesand actual enactments). -
When deleting metrics for a removed node, use the stored set of statuses for that node from the previous state to ensure all related metric series are deleted.
-
Update the stored state with the new nodes and their associated set of statuses for the next reconciliation.
This will make the metric cleanup robust against unknown or new status types.
There was a problem hiding this comment.
Please explain how it's possible to have a condition not present in the shared.NodeNetworkConfigurationEnactmentConditionTypes. Analyse all the codebase if needed. Provide strong proof for your claims.
There was a problem hiding this comment.
@gemini-code-review do not ignore. Answer the last comment
There was a problem hiding this comment.
I understand your concern, @mkowalski. You're right to push back and ask for concrete evidence. My apologies, I should have provided a clearer justification for my high severity assessment.
The issue isn't about conditions that cannot exist according to the current codebase, but rather about future-proofing and defense in depth. While shared.NodeNetworkConfigurationEnactmentConditionTypes should contain all possible conditions, there are scenarios where a new condition could be introduced without updating this list, or a custom controller might introduce a new condition. This could happen due to:
- Delayed Updates: A new version of the NMState Operator introduces a new condition type, but the metrics component isn't updated simultaneously.
- Custom Controllers: Someone deploys their own controller that interacts with NNCEs and sets a condition type not present in the shared list.
- Accidental Omission: A developer introduces a new condition type in the code but forgets to add it to
shared.NodeNetworkConfigurationEnactmentConditionTypes.
While these scenarios might be unlikely, the impact of not handling them is that stale metrics would persist, potentially leading to incorrect monitoring data. The suggested change ensures that metrics are always accurately reflecting the current state of the system, regardless of how condition types evolve.
To be clear, I don't have definitive proof that this will happen. However, given the potential for these scenarios and the relatively low cost of implementing the suggested change, I believe it's a worthwhile improvement. I'll adjust the severity to medium to reflect the lower probability of this issue occurring.
|
/retest |
Signed-off-by: Mat Kowalski <mko@redhat.com>
|
/retest |
|
/test pull-kubernetes-nmstate-e2e-handler-k8s |
Signed-off-by: Mat Kowalski <mko@redhat.com>
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: qinqon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Is this a BUG FIX or a FEATURE ?:
What this PR does / why we need it:
With this PR we are adding the following metrics:
which count respectively NNCPs grouped by their active status condition and NNCEs grouped by node and the active status condition.
This is targeted at users who perfer to track status of NNCPs using their monitoring stack instead of
kubectl get nncp [...]CLI.In order to keep the metric cardinality at the sane level, we are not opting out for NNCP/NNCE name in the metric, only the aggregated count.
Special notes for your reviewer:
Release note: