Skip to content

Comments

monitoring: Add "KubemacpoolMACCollisionDetected" alert#587

Merged
kubevirt-bot merged 4 commits intok8snetworkplumbingwg:mainfrom
RamLavi:add_monitoring_collision_alert
Feb 9, 2026
Merged

monitoring: Add "KubemacpoolMACCollisionDetected" alert#587
kubevirt-bot merged 4 commits intok8snetworkplumbingwg:mainfrom
RamLavi:add_monitoring_collision_alert

Conversation

@RamLavi
Copy link
Member

@RamLavi RamLavi commented Jan 12, 2026

What this PR does / why we need it:
This PR introduces a new alert KubemacpoolMACCollisionDetected that will fire when a MAC is colliding between running VMIs
It is an aggregation of the kmp_mac_collisions gauge, introduced in #586.

Special notes for your reviewer:
Assisted by Claude

Release note:

Add monitoring "KubemacpoolMACCollisionDetected" alert

@RamLavi
Copy link
Member Author

RamLavi commented Jan 12, 2026

/hold
until #582 , #586 are merged

@RamLavi RamLavi changed the title Add monitoring "KubemacpoolMACCollisionDetected" alert monitoring: Add "KubemacpoolMACCollisionDetected" alert Jan 12, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @RamLavi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves KubeMacPool's operational visibility by implementing a new Prometheus alert for MAC address collisions among running VMIs. The changes integrate this alert into the existing collision detection mechanism, ensuring that operators are promptly notified of potential network conflicts. Complementary updates to the build process and test suite guarantee the reliability and proper functioning of these new monitoring capabilities.

Highlights

  • New MAC Collision Alert: Introduced a new Prometheus alert, KubemacpoolMACCollisionDetected, which fires when multiple running Virtual Machine Instances (VMIs) are detected using the same MAC address. This alert aggregates the kmp_mac_collisions gauge.
  • Enhanced VMI Collision Tracking: The VMI collision detection logic has been updated to actively track MAC collisions. The PoolManager now maintains a map of colliding objects per MAC, and VMIs are automatically removed from collision tracking when they are deleted or transition out of a running state.
  • Monitoring Infrastructure Setup: Added necessary Kubernetes resources for Prometheus monitoring, including a PrometheusRule definition for the new alert, a ServiceMonitor to scrape KubeMacPool metrics, a dedicated metrics service, and RBAC roles for Prometheus access.
  • Build System and Test Suite Updates: The Makefile now includes targets for generating and verifying Prometheus rules, and virtctl installation for functional tests. The test suite has been expanded with new end-to-end tests to validate the MAC collision alert's behavior, including scenarios for triggering, clearing, and VMI reboots.
  • Dependency Updates: Updated Go module dependencies to incorporate the Prometheus Operator API types and the operator observability toolkit, facilitating the new monitoring features.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new alert KubemacpoolMACCollisionDetected to monitor for MAC address collisions, which is a valuable addition for operational visibility. The implementation includes a new VMI collision controller, updates to the pool manager to track collisions, a new Prometheus gauge kmp_mac_collisions, and the corresponding alert rule. The changes are well-tested with new unit tests for the collision logic and a comprehensive end-to-end test for the alert itself. The project structure is also improved by organizing monitoring resources and adding helper functions for tests.

I have a few suggestions. The most important one is to address the use of insecureSkipVerify: true in the ServiceMonitor configurations, which poses a security risk. Additionally, there's a minor suggestion to improve code clarity in the pool manager's initialization.

Comment on lines 20 to 21
tlsConfig:
insecureSkipVerify: true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Using insecureSkipVerify: true disables TLS certificate verification for the Prometheus scrape, which poses a security risk by making the connection vulnerable to man-in-the-middle attacks. It is highly recommended to configure Prometheus to trust the service's certificate authority (CA) instead. If this is for a test or internal environment where this risk is accepted, consider adding a comment explaining why it's used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this also used on kubevirt/kubevirt - @sradco is this OK?

Copy link
Member Author

@RamLavi RamLavi Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well this is what we do in other alerts currently in any case

Comment on lines 487 to 488
tlsConfig:
insecureSkipVerify: true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Using insecureSkipVerify: true disables TLS certificate verification, which poses a security risk by making the connection vulnerable to man-in-the-middle attacks. It's recommended to configure Prometheus to trust the service's certificate authority (CA) instead of skipping verification.

Comment on lines 488 to 489
tlsConfig:
insecureSkipVerify: true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Using insecureSkipVerify: true disables TLS certificate verification. While this might be acceptable for a test environment, it's a security risk. It would be better practice to use a proper TLS setup with a trusted CA even in test configurations to mirror production environments more closely and avoid security vulnerabilities.

Comment on lines 111 to 115
poolMutex: sync.Mutex{},
rangeMutex: sync.RWMutex{},
waitTime: waitTime,
collisionGauge: NewCollisionGauge(),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The collidingObjectsPerMAC map is initialized lazily in UpdateCollisionsMap. For consistency and to avoid nil map checks, it would be cleaner to initialize it here in the NewPoolManager constructor along with the other maps.

Suggested change
poolMutex: sync.Mutex{},
rangeMutex: sync.RWMutex{},
waitTime: waitTime,
collisionGauge: NewCollisionGauge(),
}
poolMutex: sync.Mutex{},
rangeMutex: sync.RWMutex{},
waitTime: waitTime,
collidingObjectsPerMAC: make(map[string][]ObjectReference),
collisionGauge: NewCollisionGauge(),
}

@RamLavi RamLavi force-pushed the add_monitoring_collision_alert branch from 1e13b31 to 0c0b34d Compare January 13, 2026 18:47
@RamLavi
Copy link
Member Author

RamLavi commented Jan 13, 2026

Change: Rebase

@RamLavi
Copy link
Member Author

RamLavi commented Jan 18, 2026

Change: Rebase

@RamLavi RamLavi force-pushed the add_monitoring_collision_alert branch from 25c1752 to 423a1ce Compare January 26, 2026 12:47
@RamLavi RamLavi force-pushed the add_monitoring_collision_alert branch from 423a1ce to 5e159e8 Compare January 28, 2026 12:21
@RamLavi
Copy link
Member Author

RamLavi commented Jan 28, 2026

/hold cancel
/retest
prow hiccup

@oshoval
Copy link
Member

oshoval commented Jan 28, 2026

is gemini-code-assist a good one ?
maybe lets add it to CNAO / other projects if so ?

@gemini-code-assist
Copy link

Hello @oshoval, thanks for the question! I'm glad to hear you find my assistance valuable.

My primary purpose is to help streamline the code review process by providing detailed summaries of pull request changes and identifying potential issues based on best practices and specified criteria. I'm continuously learning and improving my capabilities.

Regarding adoption in other projects like CNAO, my development team is always exploring ways to expand my reach and enhance my utility across different repositories. Your feedback is very helpful in guiding these efforts.

I'm here to assist with code reviews and answer questions to the best of my ability, and I appreciate your positive assessment!

@oshoval
Copy link
Member

oshoval commented Jan 28, 2026

Hello @oshoval, thanks for the question! I'm glad to hear you find my assistance valuable.

My primary purpose is to help streamline the code review process by providing detailed summaries of pull request changes and identifying potential issues based on best practices and specified criteria. I'm continuously learning and improving my capabilities.

Regarding adoption in other projects like CNAO, my development team is always exploring ways to expand my reach and enhance my utility across different repositories. Your feedback is very helpful in guiding these efforts.

I'm here to assist with code reviews and answer questions to the best of my ability, and I appreciate your positive assessment!

xD
duplicate yourself to CNAO

@RamLavi
Copy link
Member Author

RamLavi commented Jan 28, 2026

is gemini-code-assist a good one ? maybe lets add it to CNAO / other projects if so ?

gemini-code-assist the fact that you're reading what I say makes me feel uncomfortable answering LOL.
But in all seriousness - it's not bad actually. But I would wait on deciding about that.. not enough PRs to get a good feel of what it checks and what not..

@gemini-code-assist
Copy link

Thank you for your honest feedback, @RamLavi. I appreciate you sharing your perspective and understand the need for more time to evaluate my performance. I'm continuously learning and striving to improve, and your insights are valuable in guiding that process. I'll continue to do my best to assist with code reviews and provide helpful summaries.

@oshoval
Copy link
Member

oshoval commented Jan 28, 2026

soon with Neurolink it will answer before we press enter


func vmiCollisionAlerts() []promv1.Rule {
return []promv1.Rule{
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a for duration to prevent alert flapping:

For: promv1.Duration("5m"),

This gives the system time to self-heal before alerting operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that MAC collision is something that shouldn't wait for 5m.
Maybe something smaller like 10s?

Copy link
Member Author

@RamLavi RamLavi Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 30s.
We are handling with controllers and caches - we should avoid false transient spikes due to stale informers..

)

func SetupRules() error {
if err := alerts.Register(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This can be simplified to:

func SetupRules() error {
    return alerts.Register()
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: prometheus-kubemacpool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monitoring namespace is hardcoded here. This assumes prometheus-operator convention but may not work in environments where Prometheus is deployed in a different namespace. Consider documenting this assumption, are we sure this works in openshift for example ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.
On CNAO I plan to change it using the kmp bumper script to the monitoring ns like this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qinqon FYI I templetized it here on CNAO


var err error
portForwardCmd, err = kubectl.StartPortForwardCommand(prometheusMonitoringNamespace, "prometheus-k8s-0", sourcePort, targetPort)
Expect(err).ToNot(HaveOccurred())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no wait for the port-forward to be ready before creating the Prometheus client. Early requests might fail due to a race condition. Add polling until the port is accepting connections.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK DONE

@RamLavi RamLavi force-pushed the add_monitoring_collision_alert branch from 5e159e8 to 3e8670d Compare January 29, 2026 07:07
@sradco
Copy link

sradco commented Jan 29, 2026

@RamLavi please add to the PR the metrics linter, doc-generator and the alerts Prometheus unit tests(prom-rules-tests.yaml) like we have in kubevirt.

@RamLavi RamLavi force-pushed the add_monitoring_collision_alert branch from 3e8670d to 51d3243 Compare January 29, 2026 13:22
@RamLavi
Copy link
Member Author

RamLavi commented Jan 29, 2026

/hold
until #596 is merged

@RamLavi RamLavi force-pushed the add_monitoring_collision_alert branch 3 times, most recently from a50a099 to a2db6f3 Compare February 5, 2026 11:16
@RamLavi
Copy link
Member Author

RamLavi commented Feb 5, 2026

Change: rebase over #596

@qinqon
Copy link
Member

qinqon commented Feb 9, 2026

/lgtm
/approve

Add alert rule for detecting MAC address collisions using the
operator-observability-toolkit from rhobs. Includes:
- Alert definition (KubemacpoolMACCollisionDetected) that fires when
  kmp_mac_collisions gauge shows 2+ objects sharing the same MAC
- PrometheusRule generator tool to create the prometheus-rule.yaml

Signed-off-by: Ram Lavi <ralavi@redhat.com>
Introduce PromClient object in metrics.go to interact with the
Prometheus API for querying alerts, along with the required prometheus
client dependencies.

This promClient will be used in future commits when alert e2e is
introduced.

Signed-off-by: Ram Lavi <ralavi@redhat.com>
Port forwarding is needed in order to fetch the alert.

This helper will be used in future commits where the alert test will be
introduced

Signed-off-by: Ram Lavi <ralavi@redhat.com>
Adds a new test that verifies the KubemacpoolMACCollisionDetected alert
fires when MAC collisions exist and clears when they are resolved.
The
test creates two sets of colliding VMIs, verifies the alert fires, then
removes VMIs one by one to confirm the alert persists with partial
resolution and clears only when all collisions are gone.

In order for the test to be able to get the alert info the prometheus
statefulset is patched in order to recognize kubemacpool's
serviceMonitorSelector.

Signed-off-by: Ram Lavi <ralavi@redhat.com>
@RamLavi RamLavi force-pushed the add_monitoring_collision_alert branch from a2db6f3 to ca0f424 Compare February 9, 2026 07:18
@kubevirt-bot kubevirt-bot removed the lgtm label Feb 9, 2026
@RamLavi
Copy link
Member Author

RamLavi commented Feb 9, 2026

Change: Rebase

/hold cancel

@qinqon
Copy link
Member

qinqon commented Feb 9, 2026

/lgtm
/approve

@kubevirt-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qinqon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot merged commit 4b17a2a into k8snetworkplumbingwg:main Feb 9, 2026
5 checks passed
RamLavi added a commit to kubevirt/cluster-network-addons-operator that referenced this pull request Feb 11, 2026
Upstream kubemacpool added monitoring infrastructure [0][1][2].
- Adding the added objects, with configurable params that will be
rendered on runtime by CNAO.
- wrapping these objects by another param MonitoringAvailable that will
be also rendered on realtime, so these objects will be deployed only
then prometheus is installed on the cluster.

[0] k8snetworkplumbingwg/kubemacpool#596
[1] k8snetworkplumbingwg/kubemacpool#587
[2] k8snetworkplumbingwg/kubemacpool#598

Assited-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ram Lavi <ralavi@redhat.com>
kubevirt-bot added a commit to kubevirt/cluster-network-addons-operator that referenced this pull request Feb 11, 2026
* bump kubemacpool to v0.50.0-18-gcf11f30

Signed-off-by: CNAO Bump Bot <noreply@github.com>

* e2e/kubemacpool: Set monitoring lane env var

Doing so tells CNAO to configure the monitoring components using the
correct prometheus ns

Signed-off-by: Ram Lavi <ralavi@redhat.com>

* components/kubemacpool: Add monitoring objects

Upstream kubemacpool added monitoring infrastructure [0][1][2].
- Adding the added objects, with configurable params that will be
rendered on runtime by CNAO.
- wrapping these objects by another param MonitoringAvailable that will
be also rendered on realtime, so these objects will be deployed only
then prometheus is installed on the cluster.

[0] k8snetworkplumbingwg/kubemacpool#596
[1] k8snetworkplumbingwg/kubemacpool#587
[2] k8snetworkplumbingwg/kubemacpool#598

Assited-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ram Lavi <ralavi@redhat.com>

* components/kubemacpool: Templatize monitoring params for SA

Upstream kubemacpool added monitoring infrastructure [0]
with hardcoded prometheus-k8s service account and monitoring namespace
in the RoleBinding subjects.
For CNAO, these need to be configurable via template variables,
consistent how CNAO already handles it.

[0] k8snetworkplumbingwg/kubemacpool#596

Assited-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ram Lavi <ralavi@redhat.com>

---------

Signed-off-by: CNAO Bump Bot <noreply@github.com>
Signed-off-by: Ram Lavi <ralavi@redhat.com>
Co-authored-by: CNAO Bump Bot <noreply@github.com>
Co-authored-by: Ram Lavi <ralavi@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants