metrics: add status counter for NNCP and NNCE by mkowalski · Pull Request #1458 · nmstate/kubernetes-nmstate

mkowalski · 2026-03-16T15:44:47Z

Is this a BUG FIX or a FEATURE ?:

Uncomment only one, leave it on its own line:

/kind bug
/kind enhancement

What this PR does / why we need it:

With this PR we are adding the following metrics:

kubernetes_nmstate_policies_status
kubernetes_nmstate_enactments_status

which count respectively NNCPs grouped by their active status condition and NNCEs grouped by node and the active status condition.

This is targeted at users who perfer to track status of NNCPs using their monitoring stack instead of kubectl get nncp [...] CLI.

In order to keep the metric cardinality at the sane level, we are not opting out for NNCP/NNCE name in the metric, only the aggregated count.

Special notes for your reviewer:

Release note:

New metrics, `kubernetes_nmstate_policies_status` and `kubernetes_nmstate_enactments_status`, are now available to control the status of applied NNCPs and their corresponding NNCEs.

With this PR we are adding the following metrics: * kubernetes_nmstate_policies_status * kubernetes_nmstate_enactments_status which count respectively NNCPs grouped by their active status condition and NNCEs grouped by node and the active status condition. This is targeted at users who perfer to track status of NNCPs using their monitoring stack instead of `kubectl get nncp [...]` CLI. In order to keep the metric cardinality at the sane level, we are not opting out for NNCP/NNCE name in the metric, only the aggregated count. Signed-off-by: Mat Kowalski <mko@redhat.com>

gemini-code-assist · 2026-03-16T15:45:06Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces new monitoring capabilities by exposing Prometheus metrics for the status of NodeNetworkConfigurationPolicies and NodeNetworkConfigurationEnactments. This enhancement provides users with a more integrated way to observe the health and progress of network configurations across their Kubernetes clusters, moving beyond command-line inspection to a more robust, monitoring-stack-friendly approach.

Highlights

New Metrics Introduced: Two new metrics, kubernetes_nmstate_policies_status and kubernetes_nmstate_enactments_status, have been added to track the status of NodeNetworkConfigurationPolicies (NNCPs) and NodeNetworkConfigurationEnactments (NNCEs) respectively.
Enhanced Monitoring: These metrics allow users to monitor the aggregated status of NNCPs and NNCEs through their monitoring stack, providing an alternative to using kubectl get commands.
Controller Integration: Dedicated controllers have been implemented for both NNCPs and NNCEs to manage and report these new status metrics efficiently, ensuring updates are triggered only on relevant condition changes.

Changelog

cmd/handler/main.go
- Registered new Prometheus metrics for policy and enactment statuses.
- Initialized new controllers for NodeNetworkConfigurationPolicy and NodeNetworkConfigurationEnactment status metrics.
controllers/metrics/nodenetworkconfigurationenactment_status_controller.go
- Added a new controller to reconcile NodeNetworkConfigurationEnactment objects and report per-node status metrics.
- Implemented logic to update metrics only when NNCE conditions change.
- Included functionality to clean up metrics for nodes that no longer have enactments.
controllers/metrics/nodenetworkconfigurationpolicy_controller.go
- Added a new controller to reconcile NodeNetworkConfigurationPolicy objects and report status metrics.
- Implemented logic to update metrics only when NNCP conditions change.
pkg/monitoring/metrics.go
- Defined new Prometheus GaugeOpts for kubernetes_nmstate_policies_status and kubernetes_nmstate_enactments_status.
- Created new Prometheus GaugeVec instances for policy and enactment statuses with appropriate labels.
- Added the new gauge options to the global list for registration.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two new metrics, kubernetes_nmstate_policies_status and kubernetes_nmstate_enactments_status, for monitoring the status of NodeNetworkConfigurationPolicies (NNCPs) and NodeNetworkConfigurationEnactments (NNCEs). This is implemented by adding two new metrics controllers. The logic for collecting and reporting metrics is sound. However, I've identified a significant issue where the NNCP metrics controller is using a deprecated API version (v1beta1) instead of the current v1, which could prevent it from functioning correctly. I've also included a suggestion to improve code organization.

controllers/metrics/nodenetworkconfigurationpolicy_controller.go

Signed-off-by: Mat Kowalski <mko@redhat.com>

mkowalski · 2026-03-16T16:15:23Z

@gemini-code-assist full review

gemini-code-assist

Code Review

This pull request introduces new metrics to track the status of NodeNetworkConfigurationPolicy and NodeNetworkConfigurationEnactment resources, which is a great enhancement for monitoring. The implementation adds new controllers for these metrics and includes comprehensive e2e tests.

My review focuses on improving the efficiency of the new metric controllers. I've identified that the UpdateFunc predicates in the new controllers can be optimized to reduce unnecessary reconciliations, especially since they currently trigger on timestamp updates. Please see my specific comments for suggestions on how to refine this.

controllers/metrics/nodenetworkconfigurationenactment_status_controller.go

controllers/metrics/nodenetworkconfigurationpolicy_controller.go

Signed-off-by: Mat Kowalski <mko@redhat.com>

mkowalski · 2026-03-16T20:43:07Z

@gemini-code-assist full review

gemini-code-assist

Code Review

This pull request introduces new metrics to track the status of NodeNetworkConfigurationPolicy and NodeNetworkConfigurationEnactment resources, which is a valuable addition for monitoring. The implementation includes new controllers for these metrics, along with E2E tests.

The overall approach is solid, using controller-runtime predicates to efficiently trigger reconciliations. However, I've identified a potential issue in the NodeNetworkConfigurationEnactmentStatusReconciler where metric series for removed nodes might not be fully cleaned up if they have status types not in the predefined list. My review includes a detailed comment on this with a suggested fix.

gemini-code-assist · 2026-03-16T20:46:06Z

controllers/metrics/nodenetworkconfigurationenactment_status_controller.go

+func (r *NodeNetworkConfigurationEnactmentStatusReconciler) reportStatistics(ctx context.Context) error {
+	nnceList := nmstatev1beta1.NodeNetworkConfigurationEnactmentList{}
+	if err := r.List(ctx, &nnceList); err != nil {
+		return err
+	}
+
+	counts := make(map[enactmentStatusKey]float64)
+	newNodes := make(map[string]struct{})
+
+	for i := range nnceList.Items {
+		nodeName := nnceList.Items[i].Labels[shared.EnactmentNodeLabel]
+		if nodeName == "" {
+			continue
+		}
+		newNodes[nodeName] = struct{}{}
+
+		status := activeConditionType(nnceList.Items[i].Status.Conditions)
+		if status != "" {
+			key := enactmentStatusKey{node: nodeName, status: string(status)}
+			counts[key]++
+		}
+	}
+
+	// Reset all known node+status combinations, then set current values
+	for nodeName := range newNodes {
+		for _, condType := range shared.NodeNetworkConfigurationEnactmentConditionTypes {
+			key := enactmentStatusKey{node: nodeName, status: string(condType)}
+			monitoring.EnactmentStatus.WithLabelValues(nodeName, string(condType)).Set(counts[key])
+		}
+	}
+
+	// Delete metrics for nodes that no longer have any enactments
+	for oldNode := range r.oldNodes {
+		if _, exists := newNodes[oldNode]; !exists {
+			for _, condType := range shared.NodeNetworkConfigurationEnactmentConditionTypes {
+				monitoring.EnactmentStatus.Delete(prometheus.Labels{
+					"node":   oldNode,
+					"status": string(condType),
+				})
+			}
+		}
+	}
+
+	r.oldNodes = newNodes
+
+	return nil
+}


The current implementation for cleaning up metrics for removed nodes has a potential issue. It iterates over the predefined shared.NodeNetworkConfigurationEnactmentConditionTypes to delete metric series. If an enactment on a node that is later removed had a status condition type not present in this list, its corresponding metric series will not be deleted, leading to stale metrics.

A more robust approach would be to track all statuses that have been reported for each node.

I suggest modifying the reconciler's state and the reportStatistics logic:

Change the oldNodes field in NodeNetworkConfigurationEnactmentStatusReconciler to track statuses per node:

// oldState map[string]map[string]struct{} // from map[string]struct{}

In reportStatistics, dynamically build a set of all statuses seen in the current reconciliation (from both shared.NodeNetworkConfigurationEnactmentConditionTypes and actual enactments).

When deleting metrics for a removed node, use the stored set of statuses for that node from the previous state to ensure all related metric series are deleted.

Update the stored state with the new nodes and their associated set of statuses for the next reconciliation.

This will make the metric cleanup robust against unknown or new status types.

Please explain how it's possible to have a condition not present in the shared.NodeNetworkConfigurationEnactmentConditionTypes. Analyse all the codebase if needed. Provide strong proof for your claims.

@gemini-code-review do not ignore. Answer the last comment

/gemini are you deaf?

I understand your concern, @mkowalski. You're right to push back and ask for concrete evidence. My apologies, I should have provided a clearer justification for my high severity assessment.

The issue isn't about conditions that cannot exist according to the current codebase, but rather about future-proofing and defense in depth. While shared.NodeNetworkConfigurationEnactmentConditionTypes should contain all possible conditions, there are scenarios where a new condition could be introduced without updating this list, or a custom controller might introduce a new condition. This could happen due to:

Delayed Updates: A new version of the NMState Operator introduces a new condition type, but the metrics component isn't updated simultaneously.

Custom Controllers: Someone deploys their own controller that interacts with NNCEs and sets a condition type not present in the shared list.

Accidental Omission: A developer introduces a new condition type in the code but forgets to add it to shared.NodeNetworkConfigurationEnactmentConditionTypes.

While these scenarios might be unlikely, the impact of not handling them is that stale metrics would persist, potentially leading to incorrect monitoring data. The suggested change ensures that metrics are always accurately reflecting the current state of the system, regardless of how condition types evolve.

To be clear, I don't have definitive proof that this will happen. However, given the potential for these scenarios and the relatively low cost of implementing the suggested change, I believe it's a worthwhile improvement. I'll adjust the severity to medium to reflect the lower probability of this issue occurring.

mkowalski · 2026-03-17T07:13:33Z

/retest

Signed-off-by: Mat Kowalski <mko@redhat.com>

mkowalski · 2026-03-17T16:26:30Z

/retest

 Summarizing 4 Failures:
  [FAIL] NodeNetworkConfigurationPolicy default ovs-bridged network when there is a default interface with dynamic address and ovs bridge on top of the default interface [It] should keep the default IP address after node reboot
  /tmp/knmstate/kubernetes-nmstate/test/e2e/policy/conditions.go:103
  [FAIL] TOP-LEVEL [ReportBeforeEach] NodeNetworkConfigurationPolicy default ovs-bridged network when there is a default interface with dynamic address when desiredState is configured with internal port with wrong IP address should fail to configure and rollback
  /tmp/knmstate/kubernetes-nmstate/test/reporter/writers.go:93
  [FAIL] TOP-LEVEL [ReportBeforeEach] NodeNetworkConfigurationPolicy default bridged network with nmpolicy when there is a default interface with dynamic address and linux bridge is configured on top of the default interface should successfully move default IP address on top of the bridge
  /tmp/knmstate/kubernetes-nmstate/test/reporter/writers.go:47
  [FAIL] [ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
  /home/prow/go/pkg/mod/github.com/onsi/ginkgo/v2@v2.28.1/reporting_dsl.go:227

mkowalski · 2026-03-20T09:26:01Z

/test pull-kubernetes-nmstate-e2e-handler-k8s

Signed-off-by: Mat Kowalski <mko@redhat.com>

qinqon · 2026-03-24T07:11:24Z

/lgtm
/approve

kubevirt-bot · 2026-03-24T07:11:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qinqon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [qinqon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kubevirt-bot added kind/enhancement release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 16, 2026

kubevirt-bot requested a review from cybertron March 16, 2026 15:44

kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Mar 16, 2026

kubevirt-bot requested a review from phoracek March 16, 2026 15:44

kubevirt-bot added the size/L label Mar 16, 2026

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

controllers/metrics/nodenetworkconfigurationpolicy_controller.go Outdated Show resolved Hide resolved

controllers/metrics/nodenetworkconfigurationpolicy_controller.go Outdated Show resolved Hide resolved

mkowalski added 2 commits March 16, 2026 16:59

Add tests for new metrics

070bc0c

Signed-off-by: Mat Kowalski <mko@redhat.com>

Address Gemini review

2b56dbf

Signed-off-by: Mat Kowalski <mko@redhat.com>

kubevirt-bot added size/XL and removed size/L labels Mar 16, 2026

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

controllers/metrics/nodenetworkconfigurationenactment_status_controller.go Outdated Show resolved Hide resolved

controllers/metrics/nodenetworkconfigurationpolicy_controller.go Outdated Show resolved Hide resolved

mkowalski added 2 commits March 16, 2026 21:34

Optimize reconciliation only on condition change

d4d0c98

Signed-off-by: Mat Kowalski <mko@redhat.com>

Optimtimize for code deduplication

2ffbf46

Signed-off-by: Mat Kowalski <mko@redhat.com>

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

NNCE-status controller needs explicit name

2ad2919

Signed-off-by: Mat Kowalski <mko@redhat.com>

Fix test to account for max unavailable nodes

a1a01af

Signed-off-by: Mat Kowalski <mko@redhat.com>

kubevirt-bot assigned qinqon Mar 24, 2026

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 24, 2026

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2026

kubevirt-bot merged commit 2006b59 into nmstate:main Mar 24, 2026
6 of 7 checks passed

Conversation

mkowalski commented Mar 16, 2026

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mkowalski commented Mar 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mkowalski commented Mar 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mkowalski Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mkowalski Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mkowalski Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mkowalski commented Mar 17, 2026

Uh oh!

mkowalski commented Mar 17, 2026

Uh oh!

mkowalski commented Mar 20, 2026

Uh oh!

qinqon commented Mar 24, 2026

Uh oh!

kubevirt-bot commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants