Skip to content

metrics: add status counter for NNCP and NNCE#1458

Merged
kubevirt-bot merged 7 commits intonmstate:mainfrom
mkowalski:nncp-status-via-metric-upstream
Mar 24, 2026
Merged

metrics: add status counter for NNCP and NNCE#1458
kubevirt-bot merged 7 commits intonmstate:mainfrom
mkowalski:nncp-status-via-metric-upstream

Conversation

@mkowalski
Copy link
Member

Is this a BUG FIX or a FEATURE ?:

Uncomment only one, leave it on its own line:

/kind bug
/kind enhancement

What this PR does / why we need it:

With this PR we are adding the following metrics:

  • kubernetes_nmstate_policies_status
  • kubernetes_nmstate_enactments_status

which count respectively NNCPs grouped by their active status condition and NNCEs grouped by node and the active status condition.

This is targeted at users who perfer to track status of NNCPs using their monitoring stack instead of kubectl get nncp [...] CLI.

In order to keep the metric cardinality at the sane level, we are not opting out for NNCP/NNCE name in the metric, only the aggregated count.

Special notes for your reviewer:

Release note:

New metrics, `kubernetes_nmstate_policies_status` and `kubernetes_nmstate_enactments_status`, are now available to control the status of applied NNCPs and their corresponding NNCEs.

With this PR we are adding the following metrics:

* kubernetes_nmstate_policies_status
* kubernetes_nmstate_enactments_status

which count respectively NNCPs grouped by their active status condition
and NNCEs grouped by node and the active status condition.

This is targeted at users who perfer to track status of NNCPs using
their monitoring stack instead of `kubectl get nncp [...]` CLI.

In order to keep the metric cardinality at the sane level, we are not
opting out for NNCP/NNCE name in the metric, only the aggregated count.

Signed-off-by: Mat Kowalski <mko@redhat.com>
@kubevirt-bot kubevirt-bot added kind/enhancement release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 16, 2026
@kubevirt-bot kubevirt-bot requested a review from cybertron March 16, 2026 15:44
@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Mar 16, 2026
@kubevirt-bot kubevirt-bot requested a review from phoracek March 16, 2026 15:44
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces new monitoring capabilities by exposing Prometheus metrics for the status of NodeNetworkConfigurationPolicies and NodeNetworkConfigurationEnactments. This enhancement provides users with a more integrated way to observe the health and progress of network configurations across their Kubernetes clusters, moving beyond command-line inspection to a more robust, monitoring-stack-friendly approach.

Highlights

  • New Metrics Introduced: Two new metrics, kubernetes_nmstate_policies_status and kubernetes_nmstate_enactments_status, have been added to track the status of NodeNetworkConfigurationPolicies (NNCPs) and NodeNetworkConfigurationEnactments (NNCEs) respectively.
  • Enhanced Monitoring: These metrics allow users to monitor the aggregated status of NNCPs and NNCEs through their monitoring stack, providing an alternative to using kubectl get commands.
  • Controller Integration: Dedicated controllers have been implemented for both NNCPs and NNCEs to manage and report these new status metrics efficiently, ensuring updates are triggered only on relevant condition changes.
Changelog
  • cmd/handler/main.go
    • Registered new Prometheus metrics for policy and enactment statuses.
    • Initialized new controllers for NodeNetworkConfigurationPolicy and NodeNetworkConfigurationEnactment status metrics.
  • controllers/metrics/nodenetworkconfigurationenactment_status_controller.go
    • Added a new controller to reconcile NodeNetworkConfigurationEnactment objects and report per-node status metrics.
    • Implemented logic to update metrics only when NNCE conditions change.
    • Included functionality to clean up metrics for nodes that no longer have enactments.
  • controllers/metrics/nodenetworkconfigurationpolicy_controller.go
    • Added a new controller to reconcile NodeNetworkConfigurationPolicy objects and report status metrics.
    • Implemented logic to update metrics only when NNCP conditions change.
  • pkg/monitoring/metrics.go
    • Defined new Prometheus GaugeOpts for kubernetes_nmstate_policies_status and kubernetes_nmstate_enactments_status.
    • Created new Prometheus GaugeVec instances for policy and enactment statuses with appropriate labels.
    • Added the new gauge options to the global list for registration.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new metrics, kubernetes_nmstate_policies_status and kubernetes_nmstate_enactments_status, for monitoring the status of NodeNetworkConfigurationPolicies (NNCPs) and NodeNetworkConfigurationEnactments (NNCEs). This is implemented by adding two new metrics controllers. The logic for collecting and reporting metrics is sound. However, I've identified a significant issue where the NNCP metrics controller is using a deprecated API version (v1beta1) instead of the current v1, which could prevent it from functioning correctly. I've also included a suggestion to improve code organization.

Signed-off-by: Mat Kowalski <mko@redhat.com>
Signed-off-by: Mat Kowalski <mko@redhat.com>
@mkowalski
Copy link
Member Author

@gemini-code-assist full review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new metrics to track the status of NodeNetworkConfigurationPolicy and NodeNetworkConfigurationEnactment resources, which is a great enhancement for monitoring. The implementation adds new controllers for these metrics and includes comprehensive e2e tests.

My review focuses on improving the efficiency of the new metric controllers. I've identified that the UpdateFunc predicates in the new controllers can be optimized to reduce unnecessary reconciliations, especially since they currently trigger on timestamp updates. Please see my specific comments for suggestions on how to refine this.

Signed-off-by: Mat Kowalski <mko@redhat.com>
Signed-off-by: Mat Kowalski <mko@redhat.com>
@mkowalski
Copy link
Member Author

@gemini-code-assist full review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new metrics to track the status of NodeNetworkConfigurationPolicy and NodeNetworkConfigurationEnactment resources, which is a valuable addition for monitoring. The implementation includes new controllers for these metrics, along with E2E tests.

The overall approach is solid, using controller-runtime predicates to efficiently trigger reconciliations. However, I've identified a potential issue in the NodeNetworkConfigurationEnactmentStatusReconciler where metric series for removed nodes might not be fully cleaned up if they have status types not in the predefined list. My review includes a detailed comment on this with a suggested fix.

Comment on lines +83 to +129
func (r *NodeNetworkConfigurationEnactmentStatusReconciler) reportStatistics(ctx context.Context) error {
nnceList := nmstatev1beta1.NodeNetworkConfigurationEnactmentList{}
if err := r.List(ctx, &nnceList); err != nil {
return err
}

counts := make(map[enactmentStatusKey]float64)
newNodes := make(map[string]struct{})

for i := range nnceList.Items {
nodeName := nnceList.Items[i].Labels[shared.EnactmentNodeLabel]
if nodeName == "" {
continue
}
newNodes[nodeName] = struct{}{}

status := activeConditionType(nnceList.Items[i].Status.Conditions)
if status != "" {
key := enactmentStatusKey{node: nodeName, status: string(status)}
counts[key]++
}
}

// Reset all known node+status combinations, then set current values
for nodeName := range newNodes {
for _, condType := range shared.NodeNetworkConfigurationEnactmentConditionTypes {
key := enactmentStatusKey{node: nodeName, status: string(condType)}
monitoring.EnactmentStatus.WithLabelValues(nodeName, string(condType)).Set(counts[key])
}
}

// Delete metrics for nodes that no longer have any enactments
for oldNode := range r.oldNodes {
if _, exists := newNodes[oldNode]; !exists {
for _, condType := range shared.NodeNetworkConfigurationEnactmentConditionTypes {
monitoring.EnactmentStatus.Delete(prometheus.Labels{
"node": oldNode,
"status": string(condType),
})
}
}
}

r.oldNodes = newNodes

return nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for cleaning up metrics for removed nodes has a potential issue. It iterates over the predefined shared.NodeNetworkConfigurationEnactmentConditionTypes to delete metric series. If an enactment on a node that is later removed had a status condition type not present in this list, its corresponding metric series will not be deleted, leading to stale metrics.

A more robust approach would be to track all statuses that have been reported for each node.

I suggest modifying the reconciler's state and the reportStatistics logic:

  1. Change the oldNodes field in NodeNetworkConfigurationEnactmentStatusReconciler to track statuses per node:

    // oldState map[string]map[string]struct{} // from map[string]struct{}
  2. In reportStatistics, dynamically build a set of all statuses seen in the current reconciliation (from both shared.NodeNetworkConfigurationEnactmentConditionTypes and actual enactments).

  3. When deleting metrics for a removed node, use the stored set of statuses for that node from the previous state to ensure all related metric series are deleted.

  4. Update the stored state with the new nodes and their associated set of statuses for the next reconciliation.

This will make the metric cleanup robust against unknown or new status types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain how it's possible to have a condition not present in the shared.NodeNetworkConfigurationEnactmentConditionTypes. Analyse all the codebase if needed. Provide strong proof for your claims.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-review do not ignore. Answer the last comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gemini are you deaf?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your concern, @mkowalski. You're right to push back and ask for concrete evidence. My apologies, I should have provided a clearer justification for my high severity assessment.

The issue isn't about conditions that cannot exist according to the current codebase, but rather about future-proofing and defense in depth. While shared.NodeNetworkConfigurationEnactmentConditionTypes should contain all possible conditions, there are scenarios where a new condition could be introduced without updating this list, or a custom controller might introduce a new condition. This could happen due to:

  1. Delayed Updates: A new version of the NMState Operator introduces a new condition type, but the metrics component isn't updated simultaneously.
  2. Custom Controllers: Someone deploys their own controller that interacts with NNCEs and sets a condition type not present in the shared list.
  3. Accidental Omission: A developer introduces a new condition type in the code but forgets to add it to shared.NodeNetworkConfigurationEnactmentConditionTypes.

While these scenarios might be unlikely, the impact of not handling them is that stale metrics would persist, potentially leading to incorrect monitoring data. The suggested change ensures that metrics are always accurately reflecting the current state of the system, regardless of how condition types evolve.

To be clear, I don't have definitive proof that this will happen. However, given the potential for these scenarios and the relatively low cost of implementing the suggested change, I believe it's a worthwhile improvement. I'll adjust the severity to medium to reflect the lower probability of this issue occurring.

@mkowalski
Copy link
Member Author

/retest

Signed-off-by: Mat Kowalski <mko@redhat.com>
@mkowalski
Copy link
Member Author

/retest

 Summarizing 4 Failures:
  [FAIL] NodeNetworkConfigurationPolicy default ovs-bridged network when there is a default interface with dynamic address and ovs bridge on top of the default interface [It] should keep the default IP address after node reboot
  /tmp/knmstate/kubernetes-nmstate/test/e2e/policy/conditions.go:103
  [FAIL] TOP-LEVEL [ReportBeforeEach] NodeNetworkConfigurationPolicy default ovs-bridged network when there is a default interface with dynamic address when desiredState is configured with internal port with wrong IP address should fail to configure and rollback
  /tmp/knmstate/kubernetes-nmstate/test/reporter/writers.go:93
  [FAIL] TOP-LEVEL [ReportBeforeEach] NodeNetworkConfigurationPolicy default bridged network with nmpolicy when there is a default interface with dynamic address and linux bridge is configured on top of the default interface should successfully move default IP address on top of the bridge
  /tmp/knmstate/kubernetes-nmstate/test/reporter/writers.go:47
  [FAIL] [ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
  /home/prow/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.28.1/reporting_dsl.go:227

@mkowalski
Copy link
Member Author

/test pull-kubernetes-nmstate-e2e-handler-k8s

Signed-off-by: Mat Kowalski <mko@redhat.com>
@qinqon
Copy link
Member

qinqon commented Mar 24, 2026

/lgtm
/approve

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 24, 2026
@kubevirt-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qinqon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2026
@kubevirt-bot kubevirt-bot merged commit 2006b59 into nmstate:main Mar 24, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/enhancement lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants