monitoring: Add "KubemacpoolMACCollisionDetected" alert#587
Conversation
Summary of ChangesHello @RamLavi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves KubeMacPool's operational visibility by implementing a new Prometheus alert for MAC address collisions among running VMIs. The changes integrate this alert into the existing collision detection mechanism, ensuring that operators are promptly notified of potential network conflicts. Complementary updates to the build process and test suite guarantee the reliability and proper functioning of these new monitoring capabilities. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new alert KubemacpoolMACCollisionDetected to monitor for MAC address collisions, which is a valuable addition for operational visibility. The implementation includes a new VMI collision controller, updates to the pool manager to track collisions, a new Prometheus gauge kmp_mac_collisions, and the corresponding alert rule. The changes are well-tested with new unit tests for the collision logic and a comprehensive end-to-end test for the alert itself. The project structure is also improved by organizing monitoring resources and adding helper functions for tests.
I have a few suggestions. The most important one is to address the use of insecureSkipVerify: true in the ServiceMonitor configurations, which poses a security risk. Additionally, there's a minor suggestion to improve code clarity in the pool manager's initialization.
| tlsConfig: | ||
| insecureSkipVerify: true |
There was a problem hiding this comment.
Using insecureSkipVerify: true disables TLS certificate verification for the Prometheus scrape, which poses a security risk by making the connection vulnerable to man-in-the-middle attacks. It is highly recommended to configure Prometheus to trust the service's certificate authority (CA) instead. If this is for a test or internal environment where this risk is accepted, consider adding a comment explaining why it's used.
There was a problem hiding this comment.
I see this also used on kubevirt/kubevirt - @sradco is this OK?
There was a problem hiding this comment.
well this is what we do in other alerts currently in any case
| tlsConfig: | ||
| insecureSkipVerify: true |
There was a problem hiding this comment.
| tlsConfig: | ||
| insecureSkipVerify: true |
There was a problem hiding this comment.
Using insecureSkipVerify: true disables TLS certificate verification. While this might be acceptable for a test environment, it's a security risk. It would be better practice to use a proper TLS setup with a trusted CA even in test configurations to mirror production environments more closely and avoid security vulnerabilities.
| poolMutex: sync.Mutex{}, | ||
| rangeMutex: sync.RWMutex{}, | ||
| waitTime: waitTime, | ||
| collisionGauge: NewCollisionGauge(), | ||
| } |
There was a problem hiding this comment.
The collidingObjectsPerMAC map is initialized lazily in UpdateCollisionsMap. For consistency and to avoid nil map checks, it would be cleaner to initialize it here in the NewPoolManager constructor along with the other maps.
| poolMutex: sync.Mutex{}, | |
| rangeMutex: sync.RWMutex{}, | |
| waitTime: waitTime, | |
| collisionGauge: NewCollisionGauge(), | |
| } | |
| poolMutex: sync.Mutex{}, | |
| rangeMutex: sync.RWMutex{}, | |
| waitTime: waitTime, | |
| collidingObjectsPerMAC: make(map[string][]ObjectReference), | |
| collisionGauge: NewCollisionGauge(), | |
| } |
1e13b31 to
0c0b34d
Compare
|
Change: Rebase |
0c0b34d to
25c1752
Compare
|
Change: Rebase |
25c1752 to
423a1ce
Compare
423a1ce to
5e159e8
Compare
|
/hold cancel |
|
is gemini-code-assist a good one ? |
|
Hello @oshoval, thanks for the question! I'm glad to hear you find my assistance valuable. My primary purpose is to help streamline the code review process by providing detailed summaries of pull request changes and identifying potential issues based on best practices and specified criteria. I'm continuously learning and improving my capabilities. Regarding adoption in other projects like CNAO, my development team is always exploring ways to expand my reach and enhance my utility across different repositories. Your feedback is very helpful in guiding these efforts. I'm here to assist with code reviews and answer questions to the best of my ability, and I appreciate your positive assessment! |
xD |
gemini-code-assist the fact that you're reading what I say makes me feel uncomfortable answering LOL. |
|
Thank you for your honest feedback, @RamLavi. I appreciate you sharing your perspective and understand the need for more time to evaluate my performance. I'm continuously learning and striving to improve, and your insights are valuable in guiding that process. I'll continue to do my best to assist with code reviews and provide helpful summaries. |
|
soon with Neurolink it will answer before we press enter |
|
|
||
| func vmiCollisionAlerts() []promv1.Rule { | ||
| return []promv1.Rule{ | ||
| { |
There was a problem hiding this comment.
Consider adding a for duration to prevent alert flapping:
For: promv1.Duration("5m"),This gives the system time to self-heal before alerting operators.
There was a problem hiding this comment.
I was thinking that MAC collision is something that shouldn't wait for 5m.
Maybe something smaller like 10s?
There was a problem hiding this comment.
Added 30s.
We are handling with controllers and caches - we should avoid false transient spikes due to stale informers..
pkg/monitoring/rules/rules.go
Outdated
| ) | ||
|
|
||
| func SetupRules() error { | ||
| if err := alerts.Register(); err != nil { |
There was a problem hiding this comment.
nit: This can be simplified to:
func SetupRules() error {
return alerts.Register()
}| roleRef: | ||
| apiGroup: rbac.authorization.k8s.io | ||
| kind: Role | ||
| name: prometheus-kubemacpool |
There was a problem hiding this comment.
The monitoring namespace is hardcoded here. This assumes prometheus-operator convention but may not work in environments where Prometheus is deployed in a different namespace. Consider documenting this assumption, are we sure this works in openshift for example ?
There was a problem hiding this comment.
I agree.
On CNAO I plan to change it using the kmp bumper script to the monitoring ns like this
|
|
||
| var err error | ||
| portForwardCmd, err = kubectl.StartPortForwardCommand(prometheusMonitoringNamespace, "prometheus-k8s-0", sourcePort, targetPort) | ||
| Expect(err).ToNot(HaveOccurred()) |
There was a problem hiding this comment.
There is no wait for the port-forward to be ready before creating the Prometheus client. Early requests might fail due to a race condition. Add polling until the port is accepting connections.
5e159e8 to
3e8670d
Compare
|
@RamLavi please add to the PR the metrics linter, doc-generator and the alerts Prometheus unit tests(prom-rules-tests.yaml) like we have in kubevirt. |
3e8670d to
51d3243
Compare
|
/hold |
a50a099 to
a2db6f3
Compare
|
/lgtm |
Add alert rule for detecting MAC address collisions using the operator-observability-toolkit from rhobs. Includes: - Alert definition (KubemacpoolMACCollisionDetected) that fires when kmp_mac_collisions gauge shows 2+ objects sharing the same MAC - PrometheusRule generator tool to create the prometheus-rule.yaml Signed-off-by: Ram Lavi <ralavi@redhat.com>
Introduce PromClient object in metrics.go to interact with the Prometheus API for querying alerts, along with the required prometheus client dependencies. This promClient will be used in future commits when alert e2e is introduced. Signed-off-by: Ram Lavi <ralavi@redhat.com>
Port forwarding is needed in order to fetch the alert. This helper will be used in future commits where the alert test will be introduced Signed-off-by: Ram Lavi <ralavi@redhat.com>
Adds a new test that verifies the KubemacpoolMACCollisionDetected alert fires when MAC collisions exist and clears when they are resolved. The test creates two sets of colliding VMIs, verifies the alert fires, then removes VMIs one by one to confirm the alert persists with partial resolution and clears only when all collisions are gone. In order for the test to be able to get the alert info the prometheus statefulset is patched in order to recognize kubemacpool's serviceMonitorSelector. Signed-off-by: Ram Lavi <ralavi@redhat.com>
a2db6f3 to
ca0f424
Compare
|
Change: Rebase /hold cancel |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: qinqon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Upstream kubemacpool added monitoring infrastructure [0][1][2]. - Adding the added objects, with configurable params that will be rendered on runtime by CNAO. - wrapping these objects by another param MonitoringAvailable that will be also rendered on realtime, so these objects will be deployed only then prometheus is installed on the cluster. [0] k8snetworkplumbingwg/kubemacpool#596 [1] k8snetworkplumbingwg/kubemacpool#587 [2] k8snetworkplumbingwg/kubemacpool#598 Assited-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ram Lavi <ralavi@redhat.com>
* bump kubemacpool to v0.50.0-18-gcf11f30 Signed-off-by: CNAO Bump Bot <noreply@github.com> * e2e/kubemacpool: Set monitoring lane env var Doing so tells CNAO to configure the monitoring components using the correct prometheus ns Signed-off-by: Ram Lavi <ralavi@redhat.com> * components/kubemacpool: Add monitoring objects Upstream kubemacpool added monitoring infrastructure [0][1][2]. - Adding the added objects, with configurable params that will be rendered on runtime by CNAO. - wrapping these objects by another param MonitoringAvailable that will be also rendered on realtime, so these objects will be deployed only then prometheus is installed on the cluster. [0] k8snetworkplumbingwg/kubemacpool#596 [1] k8snetworkplumbingwg/kubemacpool#587 [2] k8snetworkplumbingwg/kubemacpool#598 Assited-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ram Lavi <ralavi@redhat.com> * components/kubemacpool: Templatize monitoring params for SA Upstream kubemacpool added monitoring infrastructure [0] with hardcoded prometheus-k8s service account and monitoring namespace in the RoleBinding subjects. For CNAO, these need to be configurable via template variables, consistent how CNAO already handles it. [0] k8snetworkplumbingwg/kubemacpool#596 Assited-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ram Lavi <ralavi@redhat.com> --------- Signed-off-by: CNAO Bump Bot <noreply@github.com> Signed-off-by: Ram Lavi <ralavi@redhat.com> Co-authored-by: CNAO Bump Bot <noreply@github.com> Co-authored-by: Ram Lavi <ralavi@redhat.com>
What this PR does / why we need it:
This PR introduces a new alert
KubemacpoolMACCollisionDetectedthat will fire when a MAC is colliding between running VMIsIt is an aggregation of the kmp_mac_collisions gauge, introduced in #586.
Special notes for your reviewer:
Assisted by Claude
Release note: