Skip to content

Add Prometheus metrics and alerts for server monitoring#718

Merged
xkonni merged 3 commits into
mainfrom
server_prom_metrics
Mar 11, 2026
Merged

Add Prometheus metrics and alerts for server monitoring#718
xkonni merged 3 commits into
mainfrom
server_prom_metrics

Conversation

@stefanhipfel

@stefanhipfel stefanhipfel commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Proposed Changes

Add Prometheus metrics and alerts for server monitoring

Summary by CodeRabbit

Release Notes

  • New Features

    • Prometheus metrics now track server state, power state, conditions, and reconciliation status
    • Added Prometheus alerting rules for server monitoring
  • Documentation

    • New observability documentation with metrics guide and quick-start access instructions
  • Tests

    • Added metrics validation test suite
  • Chores

    • Updated dependencies to support metrics collection

@github-actions github-actions Bot added size/XL documentation Improvements or additions to documentation labels Mar 2, 2026
@stefanhipfel stefanhipfel marked this pull request as ready for review March 3, 2026 08:59
@stefanhipfel stefanhipfel requested a review from a team as a code owner March 3, 2026 08:59
@coderabbitai

coderabbitai Bot commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR introduces Prometheus observability for the metal-operator by adding custom metrics collection for server state and reconciliation events. It includes a new ServerStateCollector that aggregates server status metrics, integrates reconciliation tracking into the controller loop, adds alert rules, promotes Prometheus dependencies to direct requirements, and provides comprehensive documentation.

Changes

Cohort / File(s) Summary
Metrics Core Implementation
internal/metrics/server_metrics.go, internal/controller/server_controller.go, cmd/main.go
Introduces ServerReconciliationTotal counter for tracking reconciliation outcomes and ServerStateCollector that implements prometheus.Collector to emit aggregated server state, power state, and condition metrics. Integrates metric recording into the reconciliation loop and registers the collector at startup.
Metrics Testing
internal/metrics/server_metrics_test.go
Comprehensive test suite validating ServerReconciliationTotal counter increments and ServerStateCollector metrics emission across various server states, power states, conditions, and edge cases using Ginkgo/Gomega and fake Kubernetes client.
Dependencies
go.mod
Promotes github.com/prometheus/client_golang and github.com/prometheus/client_model from indirect to direct dependencies; adds github.com/kylelemons/godebug as indirect dependency.
Alerting Configuration
config/prometheus/server_alerts.yaml
Adds 88 lines of Prometheus alert rules for monitoring server state and reconciliation metrics.
Documentation & Configuration
docs/observability/README.md, docs/observability/metrics.md, docs/.vitepress/config.mts
Adds observability documentation detailing metrics access, available metrics with example PromQL queries, alerting setup, and Grafana dashboards; updates documentation sidebar with Observability section.

Sequence Diagram

sequenceDiagram
    participant Prometheus
    participant Collector as ServerStateCollector
    participant Registry as Metrics Registry
    participant API as Kubernetes API

    Prometheus->>Registry: Scrape metrics endpoint
    Registry->>Collector: Collect() invoked
    Collector->>API: List all Server objects (5s timeout)
    API-->>Collector: Server list
    Collector->>Collector: Aggregate by State, PowerState, Conditions
    Collector-->>Registry: Emit gauge metrics<br/>(metal_server_state,<br/>metal_server_power_state,<br/>metal_server_condition_status)
    Registry-->>Prometheus: Return metrics
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Suggested reviewers

  • damyan
  • xkonni
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is significantly incomplete. It only repeats the title without providing details about changes, motivation, or fixes. The template requires multiple bullet points under 'Proposed Changes' and a 'Fixes' reference. Expand the description with specific bullet points detailing the changes (metrics added, alerts configured, documentation), motivation/rationale, and any related issue references following the provided template.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: adding Prometheus metrics and alerts for server monitoring, which aligns with all major file additions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch server_prom_metrics

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
internal/metrics/server_metrics_test.go (1)

93-131: Collector tests should assert metric labels and values, not just metric presence.

metricCount > 0 allows incorrect aggregations to pass unnoticed. Please assert exact expected values for key label sets (e.g., state=Available, state=Reserved, power_state=On/Off, and condition status counts).

Also applies to: 133-180, 182-220

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/metrics/server_metrics_test.go` around lines 93 - 131, The test
currently only checks that collector.Collect(ch) emitted some metrics; update
the It block so it decodes the emitted prometheus.Metric objects and asserts
exact label/value pairs for the server-state and power_state metrics: after
calling collector.Collect(ch) iterate the channel, convert each
prometheus.Metric to a dto.Metric (or use prometheus/testutil to gather
metrics), and assert the counts for labels state="Available" == 2 and
state="Reserved" == 1, and similarly assert expected power_state and condition
label/value combinations; reference the existing variables collector, ch,
servers and fakeClient when locating where to add these assertions so the test
verifies specific label sets/values rather than only metricCount > 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/prometheus/server_alerts.yaml`:
- Around line 67-73: The ServerReconciliationFailureSpike alert computes a
per-error-type ratio because both numerator and denominator keep the result
label; fix it by aggregating away the result label so you get an overall failure
ratio: use sum(rate(metal_server_reconciliation_total{result=~"error_.*"}[5m]))
divided by sum(rate(metal_server_reconciliation_total[5m])) (optionally with a
by(<labels>) grouping if you want per-instance/job scope) — update the alert
expression for alert: ServerReconciliationFailureSpike and ensure
metal_server_reconciliation_total's result label is not preserved in the
division.
- Around line 13-15: Replace the fragile equality check that yields empty
vectors by treating a missing metric as zero or explicitly alerting on absence:
update the NoAvailableServers and LowAvailableServerCapacity alerts (alert names
NoAvailableServers and LowAvailableServerCapacity) to use either
sum(metal_server_state{state="Available"} or on() vector(0)) < 1 (or similar
threshold) so absent series are treated as 0, or use
absent_over_time(metal_server_state{state="Available"}[5m]) if you want to fire
when the metric is missing; change the expr fields accordingly for both alerts.

In `@docs/metrics.md`:
- Around line 51-56: The fenced code blocks showing Prometheus-style metrics
(e.g., metal_server_state, metal_server_power_state,
metal_server_condition_status, metal_server_reconciliation_total) are missing
language tags and trigger MD040; update each block by adding a language
identifier (use "text") to the opening backticks for the blocks containing those
metric examples so they read ```text ... ``` to satisfy the linter at the blocks
around the examples for metal_server_state, metal_server_power_state,
metal_server_condition_status, and metal_server_reconciliation_total.

In `@internal/metrics/server_metrics.go`:
- Around line 81-86: In Collect(), replace the unbounded context.Background()
used for the Kubernetes API call with a cancellable context that has a short
timeout (e.g., 3–10s) to avoid blocking Prometheus scrapes; create the context
via context.WithTimeout(context.Background(), timeout), defer the cancel(), and
pass that ctx into c.Client.List(...) (update imports to include time if needed)
so List is always bounded and will return promptly on API stalls.
- Around line 139-157: The loop emitting condition metrics is using `for status
:= range statusMap` which iterates keys only (dropping counts) and compares
`strings.ToLower(status)` against
`string(metav1.ConditionTrue)`/`ConditionFalse` (capitalized), causing the
switch to never match and value to remain -1; fix by iterating both key and
value (e.g., `for status, count := range statusMap`) and use a case-insensitive
comparison or normalize both sides consistently (either compare
`strings.ToLower(status)` to
`strings.ToLower(string(metav1.ConditionTrue))`/`strings.ToLower(string(metav1.ConditionFalse))`
or compare `status` directly), then emit `count` (as the metric value) via
`prometheus.MustNewConstMetric` with `c.conditionDesc` so the actual server
counts and correct numeric condition values (1/0) are reported instead of -1.

---

Nitpick comments:
In `@internal/metrics/server_metrics_test.go`:
- Around line 93-131: The test currently only checks that collector.Collect(ch)
emitted some metrics; update the It block so it decodes the emitted
prometheus.Metric objects and asserts exact label/value pairs for the
server-state and power_state metrics: after calling collector.Collect(ch)
iterate the channel, convert each prometheus.Metric to a dto.Metric (or use
prometheus/testutil to gather metrics), and assert the counts for labels
state="Available" == 2 and state="Reserved" == 1, and similarly assert expected
power_state and condition label/value combinations; reference the existing
variables collector, ch, servers and fakeClient when locating where to add these
assertions so the test verifies specific label sets/values rather than only
metricCount > 0.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c05e8e and c9a6ab8.

📒 Files selected for processing (8)
  • README.md
  • cmd/main.go
  • config/prometheus/server_alerts.yaml
  • docs/metrics.md
  • go.mod
  • internal/controller/server_controller.go
  • internal/metrics/server_metrics.go
  • internal/metrics/server_metrics_test.go

Comment thread config/prometheus/server_alerts.yaml
Comment thread config/prometheus/server_alerts.yaml
Comment thread docs/metrics.md Outdated
Comment thread internal/metrics/server_metrics.go Outdated
Comment thread internal/metrics/server_metrics.go Outdated
- Update alert expressions for NoAvailableServers and LowAvailableServerCapacity to use sum for better accuracy.
- Add new alert for missing server metrics.
- Improve documentation for server condition status and example values.
- Implement a timeout for metrics collection to prevent hanging scrapes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/prometheus/server_alerts.yaml`:
- Around line 67-69: The ServerMetricsMissing alert currently uses the
expression absent(metal_server_state{state="Available"}) which triggers when no
"Available" time series exist (including valid zero-available states); update
the alert's expr to absent(metal_server_state) so it only fires when the entire
metal_server_state metric is missing; modify the ServerMetricsMissing alert
definition (alert: ServerMetricsMissing) to replace the expr accordingly and
keep the existing for: 5m and other labels/annotations intact.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9a6ab8 and 06b2146.

📒 Files selected for processing (3)
  • config/prometheus/server_alerts.yaml
  • docs/metrics.md
  • internal/metrics/server_metrics.go

Comment thread config/prometheus/server_alerts.yaml
Comment on lines +176 to +235
## Alerting Rules

Example PrometheusRule resource:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: metal-operator-server-alerts
namespace: metal-operator-system
spec:
groups:
- name: metal_operator_servers
interval: 30s
rules:
- alert: NoAvailableServers
expr: metal_server_state{state="Available"} == 0
for: 5m
annotations:
summary: "No available servers in the fleet"
description: "All servers are either Reserved, in Maintenance, or in Error state"
labels:
severity: warning

- alert: ServersInErrorState
expr: metal_server_state{state="Error"} > 0
for: 2m
annotations:
summary: "Servers are in Error state"
description: "{{ $value }} server(s) are in Error state and require attention"
labels:
severity: critical

- alert: ServersPoweringOnTooLong
expr: metal_server_power_state{power_state="PoweringOn"} > 0
for: 10m
annotations:
summary: "Servers stuck in PoweringOn state"
description: "{{ $value }} server(s) have been in PoweringOn state for over 10 minutes"
labels:
severity: warning

- alert: HighReconciliationErrorRate
expr: rate(metal_server_reconciliation_total{result=~"error_.*"}[5m]) > 0.1
for: 5m
annotations:
summary: "High server reconciliation error rate"
description: "Server reconciliation errors are occurring at {{ $value | humanize }} per second"
labels:
severity: warning

- alert: LowAvailableServerCapacity
expr: metal_server_state{state="Available"} < 2
for: 5m
annotations:
summary: "Low available server capacity"
description: "Only {{ $value }} server(s) are available"
labels:
severity: warning
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Alerting example is stale versus the shipped rules.

The example still shows the old NoAvailableServers/LowAvailableServerCapacity expressions and omits newer alerts, so copy/paste users will get outdated behavior.

🛠️ Suggested doc sync
-    - alert: NoAvailableServers
-      expr: metal_server_state{state="Available"} == 0
+    - alert: NoAvailableServers
+      expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 1
@@
-    - alert: LowAvailableServerCapacity
-      expr: metal_server_state{state="Available"} < 2
+    - alert: LowAvailableServerCapacity
+      expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 2

Comment thread internal/metrics/server_metrics_test.go Outdated
})

It("should count servers by state", func() {
ctx := context.Background()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use SpecContext in the func().

- name: metal_operator_servers
interval: 30s
rules:
- alert: NoAvailableServers

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want this rule? As the goal will be to use all available servers, this alert will always fire. @coderabbitai what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat config/prometheus/server_alerts.yaml

Length of output: 2978


@afritzler, you raise a very valid point. The NoAvailableServers alert as currently defined will indeed fire whenever the fleet is fully utilized — which is the desired operational state. The alert's own description already betrays the ambiguity: it lists "Reserved" (healthy, in-use) alongside "Maintenance" and "Error" (genuinely problematic) as possible causes.

A few specific issues:

  1. False positives in healthy operation: When all servers are Reserved, this alert fires, yet nothing is wrong.
  2. Redundancy with LowAvailableServerCapacity: That alert already fires at < 2 available servers, which is a strict superset of the zero case. NoAvailableServers adds no new signal.
  3. No distinction between "all used" vs. "all broken": The genuinely alarming scenario — zero available servers because they are all in Error/Maintenance — is already partially covered by ServersInErrorState.

Suggested options:

  • Option A (Simplest): Remove NoAvailableServers entirely and rely on LowAvailableServerCapacity + ServersInErrorState to cover the relevant failure modes.

  • Option B: Reformulate it to only fire when zero servers are available and zero are reserved (meaning the fleet isn't productively in use at all):

    expr: |
      sum(metal_server_state{state="Available"} or on() vector(0)) < 1
      AND
      sum(metal_server_state{state="Reserved"} or on() vector(0)) == 0

    This would correctly distinguish between "full utilization" and "fleet is completely idle but nothing is available," which is the more actionable signal.

Option A is likely sufficient given the existing alert coverage, but Option B could be useful if operators specifically want to detect a completely idle and empty fleet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanhipfel I think option B is actually what we want.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

- name: metal_operator_servers
interval: 30s
rules:
- alert: NoAvailableServers

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -0,0 +1,354 @@
# Prometheus Metrics

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we open top level in the docs a section for observability and have a subsection for metrics.

Comment thread docs/metrics.md Outdated
Comment on lines +349 to +354
## References

- [Prometheus Best Practices](https://prometheus.io/docs/practices/naming/)
- [Kubernetes Monitoring Guide](https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/)
- [Prometheus Operator Documentation](https://prometheus-operator.dev/)
- [Controller-Runtime Metrics](https://book.kubebuilder.io/reference/metrics.html)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove those references here.

Comment thread internal/metrics/server_metrics_test.go Outdated
})

It("should count servers by power state", func() {
ctx := context.Background()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment thread internal/metrics/server_metrics_test.go Outdated
Comment thread internal/metrics/server_metrics_test.go Outdated
Comment thread README.md Outdated
kubectl apply -f https://raw.githubusercontent.com/<org>/metal-operator/<tag or branch>/dist/install.yaml
```

## Monitoring

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the monitoring part out of the README and only rely on the information provided in the docs folder.

@afritzler afritzler added the enhancement New feature or request label Mar 5, 2026
@stefanhipfel stefanhipfel requested a review from afritzler March 6, 2026 10:03

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
docs/.vitepress/config.mts (1)

88-94: ⚠️ Potential issue | 🟡 Minor

Expose the overview page from the new Observability group.

This PR adds docs/observability/README.md, but the sidebar only links to /observability/metrics. That leaves the section landing page without a sidebar entry.

🛠️ Suggested sidebar tweak
       {
         text: 'Observability',
         collapsed: false,
         items: [
+          { text: 'Overview', link: '/observability/' },
           { text: 'Metrics', link: '/observability/metrics' },
         ]
       },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/.vitepress/config.mts` around lines 88 - 94, The Observability section
in the sidebar object (the object with text 'Observability' and collapsed:
false) currently only lists the Metrics item; add an entry for the
overview/landing page by inserting an item with text 'Overview' (or
'Observability') and link '/observability/' (or '/observability') into the items
array before the Metrics entry so the newly added docs/observability/README.md
is reachable from the sidebar.
docs/observability/metrics.md (1)

191-229: ⚠️ Potential issue | 🟡 Minor

Make the availability alerts zero-safe.

These expressions stop matching when metal_server_state{state="Available"} is absent, so the example misses the exact “zero available servers” case. internal/metrics/server_metrics_test.go Lines 80-91 already shows the collector can emit no series at all.

🛠️ Suggested doc fix
-    - alert: NoAvailableServers
-      expr: metal_server_state{state="Available"} == 0
+    - alert: NoAvailableServers
+      expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 1
@@
-    - alert: LowAvailableServerCapacity
-      expr: metal_server_state{state="Available"} < 2
+    - alert: LowAvailableServerCapacity
+      expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 2
In PromQL alerting, why does `metric{label="value"} == 0` not fire when the series is absent, and what is the recommended pattern to treat a missing series as zero?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/observability/metrics.md` around lines 191 - 229, The alerts that check
metal_server_state{state="Available"} (e.g., NoAvailableServers and
LowAvailableServerCapacity) are not zero-safe because the expression stops
matching when the series is absent; update those rules to treat a missing series
as zero by OR-ing with absent(...) (for example, change the expr to something
like: (metal_server_state{state="Available"} == 0) or
absent(metal_server_state{state="Available"})) so the alert fires when the
metric is missing as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/metrics/server_metrics_test.go`:
- Around line 259-275: The test currently only asserts no panic when calling
collector.Collect, but must also verify that no metric series is emitted with
empty state labels; after calling collector.Collect(ch) and closing ch, drain
the channel and inspect each prometheus.Metric produced by the collector (using
the existing ch variable and collector.Collect call) to convert metrics to their
dto.MetricFamily or use prometheus/testutil.ToFloat64 where appropriate, then
assert that none of the metrics contain label pairs with Name "state" or
"power_state" with an empty Value (i.e., ensure label.Value != ""). Keep the
existing created server (server with empty Status) and add expectations that no
metric has state=="" and no metric has power_state=="" to enforce bounded-label
behavior.

---

Duplicate comments:
In `@docs/.vitepress/config.mts`:
- Around line 88-94: The Observability section in the sidebar object (the object
with text 'Observability' and collapsed: false) currently only lists the Metrics
item; add an entry for the overview/landing page by inserting an item with text
'Overview' (or 'Observability') and link '/observability/' (or '/observability')
into the items array before the Metrics entry so the newly added
docs/observability/README.md is reachable from the sidebar.

In `@docs/observability/metrics.md`:
- Around line 191-229: The alerts that check
metal_server_state{state="Available"} (e.g., NoAvailableServers and
LowAvailableServerCapacity) are not zero-safe because the expression stops
matching when the series is absent; update those rules to treat a missing series
as zero by OR-ing with absent(...) (for example, change the expr to something
like: (metal_server_state{state="Available"} == 0) or
absent(metal_server_state{state="Available"})) so the alert fires when the
metric is missing as well.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 38ce2386-8ddb-4088-9dc2-4527a612e560

📥 Commits

Reviewing files that changed from the base of the PR and between 06b2146 and d49ab7a.

📒 Files selected for processing (6)
  • config/prometheus/server_alerts.yaml
  • docs/.vitepress/config.mts
  • docs/observability/README.md
  • docs/observability/metrics.md
  • go.mod
  • internal/metrics/server_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • config/prometheus/server_alerts.yaml
  • go.mod

Comment on lines +259 to +275
It("should handle servers without state gracefully", func(ctx SpecContext) {
// Create server without state set
server := metalv1alpha1.Server{
ObjectMeta: metav1.ObjectMeta{Name: "server1"},
Spec: metalv1alpha1.ServerSpec{SystemUUID: "uuid1"},
Status: metalv1alpha1.ServerStatus{},
}

Expect(fakeClient.Create(ctx, &server)).To(Succeed())

// Collect metrics - should not panic
ch := make(chan prometheus.Metric, 100)
Expect(func() {
collector.Collect(ch)
close(ch)
}).NotTo(Panic())
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert the empty-label case, not just the panic case.

This spec still passes if the collector starts exporting state="" or power_state="". Please inspect the collected metrics and assert that empty status fields do not create any series, otherwise the bounded-label behavior stays untested.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/metrics/server_metrics_test.go` around lines 259 - 275, The test
currently only asserts no panic when calling collector.Collect, but must also
verify that no metric series is emitted with empty state labels; after calling
collector.Collect(ch) and closing ch, drain the channel and inspect each
prometheus.Metric produced by the collector (using the existing ch variable and
collector.Collect call) to convert metrics to their dto.MetricFamily or use
prometheus/testutil.ToFloat64 where appropriate, then assert that none of the
metrics contain label pairs with Name "state" or "power_state" with an empty
Value (i.e., ensure label.Value != ""). Keep the existing created server (server
with empty Status) and add expectations that no metric has state=="" and no
metric has power_state=="" to enforce bounded-label behavior.

@xkonni xkonni left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with minor question

interval: 30s
rules:
- alert: NoAvailableServers
expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 1 AND sum(metal_server_state{state="Reserved"} or on() vector(0)) == 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why < 1 on the first and == 0 on the second?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the same in the end

@xkonni xkonni dismissed afritzler’s stale review March 11, 2026 10:48

all required changes were implemented.

@xkonni xkonni merged commit ab64dae into main Mar 11, 2026
16 of 17 checks passed
@xkonni xkonni deleted the server_prom_metrics branch March 11, 2026 10:49
@github-project-automation github-project-automation Bot moved this to Done in Roadmap Mar 11, 2026
davidgrun pushed a commit that referenced this pull request Mar 18, 2026
* Add Prometheus metrics and alerts for server monitoring

* Enhance Prometheus alerts and metrics for server monitoring

- Update alert expressions for NoAvailableServers and LowAvailableServerCapacity to use sum for better accuracy.
- Add new alert for missing server metrics.
- Improve documentation for server condition status and example values.
- Implement a timeout for metrics collection to prevent hanging scrapes.

* Address PR #718 review comments
afritzler added a commit that referenced this pull request Mar 19, 2026
…tory function (#733)

* Implement CheckBMCPendingComponentUpgrade for vendor-specific handling and integrate checks in BIOS and BMC upgrade processes

* Remove duplicate linter exclusions for 'dupl' and 'lll' in bmc path

* Fix linter exclusions for 'dupl' and 'lll' in bmc path

* Refactor CheckBMCPendingComponentUpgrade to use ComponentType for type safety and improve clarity

* Enhance CheckBMCPendingComponentUpgrade to validate component types and clarify behavior for local providers. Add better comment for AI BOT

* add missing chart selector labels to registry service (#726)

* Add Prometheus metrics and alerts for server monitoring (#718)

* Add Prometheus metrics and alerts for server monitoring

* Enhance Prometheus alerts and metrics for server monitoring

- Update alert expressions for NoAvailableServers and LowAvailableServerCapacity to use sum for better accuracy.
- Add new alert for missing server metrics.
- Improve documentation for server condition status and example values.
- Implement a timeout for metrics collection to prevent hanging scrapes.

* Address PR #718 review comments

* Add Redfish event subscription management for servers (#494)

- Implement event and metrics subscription handling
- Add subscription links to ServerStatus
- Integrate with BMC interface for CreateEventSubscription/DeleteEventSubscription
- Add proper error handling and logging
- Include SPDX license headers
- Fix lint issues and improve code quality

This change enables automatic subscription to Redfish events and metrics
reports for managed servers, improving observability and monitoring capabilities.

* Bump docker/setup-buildx-action from 3 to 4 (#732)

Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3 to 4.
- [Release notes](https://github.com/docker/setup-buildx-action/releases)
- [Commits](docker/setup-buildx-action@v3...v4)

---
updated-dependencies:
- dependency-name: docker/setup-buildx-action
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/setup-qemu-action from 3 to 4 (#731)

Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 3 to 4.
- [Release notes](https://github.com/docker/setup-qemu-action/releases)
- [Commits](docker/setup-qemu-action@v3...v4)

---
updated-dependencies:
- dependency-name: docker/setup-qemu-action
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/metadata-action from 5 to 6 (#730)

Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 5 to 6.
- [Release notes](https://github.com/docker/metadata-action/releases)
- [Commits](docker/metadata-action@v5...v6)

---
updated-dependencies:
- dependency-name: docker/metadata-action
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump golang from 1.26.0 to 1.26.1 (#729)

Bumps golang from 1.26.0 to 1.26.1.

---
updated-dependencies:
- dependency-name: golang
  dependency-version: 1.26.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/build-push-action from 6 to 7 (#728)

Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v6...v7)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump dompurify from 3.2.6 to 3.3.3 (#734)

Bumps [dompurify](https://github.com/cure53/DOMPurify) from 3.2.6 to 3.3.3.
- [Release notes](https://github.com/cure53/DOMPurify/releases)
- [Commits](cure53/DOMPurify@3.2.6...3.3.3)

---
updated-dependencies:
- dependency-name: dompurify
  dependency-version: 3.3.3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump docker/login-action from 3 to 4 (#727)

Bumps [docker/login-action](https://github.com/docker/login-action) from 3 to 4.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](docker/login-action@v3...v4)

---
updated-dependencies:
- dependency-name: docker/login-action
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Tidy go.mod (#736)

* Bump golang.org/x/crypto from 0.48.0 to 0.49.0 (#738)

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.48.0 to 0.49.0.
- [Commits](golang/crypto@v0.48.0...v0.49.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.49.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Refactor CheckBMCPendingComponentUpgrade methods to simplify signature and clarify vendor-specific behavior for redfish kube and local.

Adjust the comment in redfish.go to explain what the function CheckBMCPendingComponentUpgrade is doing

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Alan Sergeant <alan.sergeant@sap.com>
Co-authored-by: Stefan Hipfel <stefan.hipfel@sap.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andreas Fritzler <andreas.fritzler@sap.com>
@stefanhipfel stefanhipfel mentioned this pull request Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/metal-automation documentation Improvements or additions to documentation enhancement New feature or request highlight size/XL

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants