Add Prometheus metrics and alerts for server monitoring by stefanhipfel · Pull Request #718 · ironcore-dev/metal-operator

stefanhipfel · 2026-03-02T18:06:39Z

Proposed Changes

Add Prometheus metrics and alerts for server monitoring

Summary by CodeRabbit

Release Notes

New Features
- Prometheus metrics now track server state, power state, conditions, and reconciliation status
- Added Prometheus alerting rules for server monitoring
Documentation
- New observability documentation with metrics guide and quick-start access instructions
Tests
- Added metrics validation test suite
Chores
- Updated dependencies to support metrics collection

coderabbitai · 2026-03-03T09:09:16Z

Walkthrough

This PR introduces Prometheus observability for the metal-operator by adding custom metrics collection for server state and reconciliation events. It includes a new ServerStateCollector that aggregates server status metrics, integrates reconciliation tracking into the controller loop, adds alert rules, promotes Prometheus dependencies to direct requirements, and provides comprehensive documentation.

Changes

Cohort / File(s)	Summary
Metrics Core Implementation `internal/metrics/server_metrics.go`, `internal/controller/server_controller.go`, `cmd/main.go`	Introduces ServerReconciliationTotal counter for tracking reconciliation outcomes and ServerStateCollector that implements prometheus.Collector to emit aggregated server state, power state, and condition metrics. Integrates metric recording into the reconciliation loop and registers the collector at startup.
Metrics Testing `internal/metrics/server_metrics_test.go`	Comprehensive test suite validating ServerReconciliationTotal counter increments and ServerStateCollector metrics emission across various server states, power states, conditions, and edge cases using Ginkgo/Gomega and fake Kubernetes client.
Dependencies `go.mod`	Promotes `github.com/prometheus/client_golang` and `github.com/prometheus/client_model` from indirect to direct dependencies; adds `github.com/kylelemons/godebug` as indirect dependency.
Alerting Configuration `config/prometheus/server_alerts.yaml`	Adds 88 lines of Prometheus alert rules for monitoring server state and reconciliation metrics.
Documentation & Configuration `docs/observability/README.md`, `docs/observability/metrics.md`, `docs/.vitepress/config.mts`	Adds observability documentation detailing metrics access, available metrics with example PromQL queries, alerting setup, and Grafana dashboards; updates documentation sidebar with Observability section.

Sequence Diagram

sequenceDiagram
    participant Prometheus
    participant Collector as ServerStateCollector
    participant Registry as Metrics Registry
    participant API as Kubernetes API

    Prometheus->>Registry: Scrape metrics endpoint
    Registry->>Collector: Collect() invoked
    Collector->>API: List all Server objects (5s timeout)
    API-->>Collector: Server list
    Collector->>Collector: Aggregate by State, PowerState, Conditions
    Collector-->>Registry: Emit gauge metrics<br/>(metal_server_state,<br/>metal_server_power_state,<br/>metal_server_condition_status)
    Registry-->>Prometheus: Return metrics

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Suggested reviewers

damyan
xkonni

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is significantly incomplete. It only repeats the title without providing details about changes, motivation, or fixes. The template requires multiple bullet points under 'Proposed Changes' and a 'Fixes' reference.	Expand the description with specific bullet points detailing the changes (metrics added, alerts configured, documentation), motivation/rationale, and any related issue references following the provided template.
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title clearly and concisely summarizes the main change: adding Prometheus metrics and alerts for server monitoring, which aligns with all major file additions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch server_prom_metrics

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

internal/metrics/server_metrics_test.go (1)
93-131: Collector tests should assert metric labels and values, not just metric presence.

metricCount > 0 allows incorrect aggregations to pass unnoticed. Please assert exact expected values for key label sets (e.g., state=Available, state=Reserved, power_state=On/Off, and condition status counts).

Also applies to: 133-180, 182-220
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/metrics/server_metrics_test.go` around lines 93 - 131, The test
currently only checks that collector.Collect(ch) emitted some metrics; update
the It block so it decodes the emitted prometheus.Metric objects and asserts
exact label/value pairs for the server-state and power_state metrics: after
calling collector.Collect(ch) iterate the channel, convert each
prometheus.Metric to a dto.Metric (or use prometheus/testutil to gather
metrics), and assert the counts for labels state="Available" == 2 and
state="Reserved" == 1, and similarly assert expected power_state and condition
label/value combinations; reference the existing variables collector, ch,
servers and fakeClient when locating where to add these assertions so the test
verifies specific label sets/values rather than only metricCount > 0.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/prometheus/server_alerts.yaml`:
- Around line 67-73: The ServerReconciliationFailureSpike alert computes a
per-error-type ratio because both numerator and denominator keep the result
label; fix it by aggregating away the result label so you get an overall failure
ratio: use sum(rate(metal_server_reconciliation_total{result=~"error_.*"}[5m]))
divided by sum(rate(metal_server_reconciliation_total[5m])) (optionally with a
by(<labels>) grouping if you want per-instance/job scope) — update the alert
expression for alert: ServerReconciliationFailureSpike and ensure
metal_server_reconciliation_total's result label is not preserved in the
division.
- Around line 13-15: Replace the fragile equality check that yields empty
vectors by treating a missing metric as zero or explicitly alerting on absence:
update the NoAvailableServers and LowAvailableServerCapacity alerts (alert names
NoAvailableServers and LowAvailableServerCapacity) to use either
sum(metal_server_state{state="Available"} or on() vector(0)) < 1 (or similar
threshold) so absent series are treated as 0, or use
absent_over_time(metal_server_state{state="Available"}[5m]) if you want to fire
when the metric is missing; change the expr fields accordingly for both alerts.

In `@docs/metrics.md`:
- Around line 51-56: The fenced code blocks showing Prometheus-style metrics
(e.g., metal_server_state, metal_server_power_state,
metal_server_condition_status, metal_server_reconciliation_total) are missing
language tags and trigger MD040; update each block by adding a language
identifier (use "text") to the opening backticks for the blocks containing those
metric examples so they read ```text ... ``` to satisfy the linter at the blocks
around the examples for metal_server_state, metal_server_power_state,
metal_server_condition_status, and metal_server_reconciliation_total.

In `@internal/metrics/server_metrics.go`:
- Around line 81-86: In Collect(), replace the unbounded context.Background()
used for the Kubernetes API call with a cancellable context that has a short
timeout (e.g., 3–10s) to avoid blocking Prometheus scrapes; create the context
via context.WithTimeout(context.Background(), timeout), defer the cancel(), and
pass that ctx into c.Client.List(...) (update imports to include time if needed)
so List is always bounded and will return promptly on API stalls.
- Around line 139-157: The loop emitting condition metrics is using `for status
:= range statusMap` which iterates keys only (dropping counts) and compares
`strings.ToLower(status)` against
`string(metav1.ConditionTrue)`/`ConditionFalse` (capitalized), causing the
switch to never match and value to remain -1; fix by iterating both key and
value (e.g., `for status, count := range statusMap`) and use a case-insensitive
comparison or normalize both sides consistently (either compare
`strings.ToLower(status)` to
`strings.ToLower(string(metav1.ConditionTrue))`/`strings.ToLower(string(metav1.ConditionFalse))`
or compare `status` directly), then emit `count` (as the metric value) via
`prometheus.MustNewConstMetric` with `c.conditionDesc` so the actual server
counts and correct numeric condition values (1/0) are reported instead of -1.

---

Nitpick comments:
In `@internal/metrics/server_metrics_test.go`:
- Around line 93-131: The test currently only checks that collector.Collect(ch)
emitted some metrics; update the It block so it decodes the emitted
prometheus.Metric objects and asserts exact label/value pairs for the
server-state and power_state metrics: after calling collector.Collect(ch)
iterate the channel, convert each prometheus.Metric to a dto.Metric (or use
prometheus/testutil to gather metrics), and assert the counts for labels
state="Available" == 2 and state="Reserved" == 1, and similarly assert expected
power_state and condition label/value combinations; reference the existing
variables collector, ch, servers and fakeClient when locating where to add these
assertions so the test verifies specific label sets/values rather than only
metricCount > 0.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c05e8e and c9a6ab8.

📒 Files selected for processing (8)

README.md
cmd/main.go
config/prometheus/server_alerts.yaml
docs/metrics.md
go.mod
internal/controller/server_controller.go
internal/metrics/server_metrics.go
internal/metrics/server_metrics_test.go

- Update alert expressions for NoAvailableServers and LowAvailableServerCapacity to use sum for better accuracy. - Add new alert for missing server metrics. - Improve documentation for server condition status and example values. - Implement a timeout for metrics collection to prevent hanging scrapes.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config/prometheus/server_alerts.yaml`:
- Around line 67-69: The ServerMetricsMissing alert currently uses the
expression absent(metal_server_state{state="Available"}) which triggers when no
"Available" time series exist (including valid zero-available states); update
the alert's expr to absent(metal_server_state) so it only fires when the entire
metal_server_state metric is missing; modify the ServerMetricsMissing alert
definition (alert: ServerMetricsMissing) to replace the expr accordingly and
keep the existing for: 5m and other labels/annotations intact.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9a6ab8 and 06b2146.

📒 Files selected for processing (3)

config/prometheus/server_alerts.yaml
docs/metrics.md
internal/metrics/server_metrics.go

coderabbitai · 2026-03-03T12:27:33Z

+## Alerting Rules
+
+Example PrometheusRule resource:
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: metal-operator-server-alerts
+  namespace: metal-operator-system
+spec:
+  groups:
+  - name: metal_operator_servers
+    interval: 30s
+    rules:
+    - alert: NoAvailableServers
+      expr: metal_server_state{state="Available"} == 0
+      for: 5m
+      annotations:
+        summary: "No available servers in the fleet"
+        description: "All servers are either Reserved, in Maintenance, or in Error state"
+      labels:
+        severity: warning
+
+    - alert: ServersInErrorState
+      expr: metal_server_state{state="Error"} > 0
+      for: 2m
+      annotations:
+        summary: "Servers are in Error state"
+        description: "{{ $value }} server(s) are in Error state and require attention"
+      labels:
+        severity: critical
+
+    - alert: ServersPoweringOnTooLong
+      expr: metal_server_power_state{power_state="PoweringOn"} > 0
+      for: 10m
+      annotations:
+        summary: "Servers stuck in PoweringOn state"
+        description: "{{ $value }} server(s) have been in PoweringOn state for over 10 minutes"
+      labels:
+        severity: warning
+
+    - alert: HighReconciliationErrorRate
+      expr: rate(metal_server_reconciliation_total{result=~"error_.*"}[5m]) > 0.1
+      for: 5m
+      annotations:
+        summary: "High server reconciliation error rate"
+        description: "Server reconciliation errors are occurring at {{ $value | humanize }} per second"
+      labels:
+        severity: warning
+
+    - alert: LowAvailableServerCapacity
+      expr: metal_server_state{state="Available"} < 2
+      for: 5m
+      annotations:
+        summary: "Low available server capacity"
+        description: "Only {{ $value }} server(s) are available"
+      labels:
+        severity: warning
+```


⚠️ Potential issue | 🟡 Minor

Alerting example is stale versus the shipped rules.

The example still shows the old NoAvailableServers/LowAvailableServerCapacity expressions and omits newer alerts, so copy/paste users will get outdated behavior.

🛠️ Suggested doc sync

- - alert: NoAvailableServers - expr: metal_server_state{state="Available"} == 0 + - alert: NoAvailableServers + expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 1 @@ - - alert: LowAvailableServerCapacity - expr: metal_server_state{state="Available"} < 2 + - alert: LowAvailableServerCapacity + expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 2

afritzler · 2026-03-03T09:08:33Z

+	})
+
+	It("should count servers by state", func() {
+		ctx := context.Background()


Please use SpecContext in the func().

afritzler · 2026-03-05T05:29:26Z

+  - name: metal_operator_servers
+    interval: 30s
+    rules:
+    - alert: NoAvailableServers


Do we really want this rule? As the goal will be to use all available servers, this alert will always fire. @coderabbitai what do you think?

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash cat config/prometheus/server_alerts.yaml

Length of output: 2978

@afritzler, you raise a very valid point. The NoAvailableServers alert as currently defined will indeed fire whenever the fleet is fully utilized — which is the desired operational state. The alert's own description already betrays the ambiguity: it lists "Reserved" (healthy, in-use) alongside "Maintenance" and "Error" (genuinely problematic) as possible causes.

A few specific issues:

False positives in healthy operation: When all servers are Reserved, this alert fires, yet nothing is wrong.

Redundancy with LowAvailableServerCapacity: That alert already fires at < 2 available servers, which is a strict superset of the zero case. NoAvailableServers adds no new signal.

No distinction between "all used" vs. "all broken": The genuinely alarming scenario — zero available servers because they are all in Error/Maintenance — is already partially covered by ServersInErrorState.

Suggested options:

Option A (Simplest): Remove NoAvailableServers entirely and rely on LowAvailableServerCapacity + ServersInErrorState to cover the relevant failure modes.

Option B: Reformulate it to only fire when zero servers are available and zero are reserved (meaning the fleet isn't productively in use at all):

expr: | sum(metal_server_state{state="Available"} or on() vector(0)) < 1 AND sum(metal_server_state{state="Reserved"} or on() vector(0)) == 0

This would correctly distinguish between "full utilization" and "fleet is completely idle but nothing is available," which is the more actionable signal.

Option A is likely sufficient given the existing alert coverage, but Option B could be useful if operators specifically want to detect a completely idle and empty fleet.

@stefanhipfel I think option B is actually what we want.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

afritzler · 2026-03-05T05:30:55Z

+  - name: metal_operator_servers
+    interval: 30s
+    rules:
+    - alert: NoAvailableServers


Same as above.

afritzler · 2026-03-05T05:31:46Z

@@ -0,0 +1,354 @@
+# Prometheus Metrics


Can we open top level in the docs a section for observability and have a subsection for metrics.

afritzler · 2026-03-05T05:32:35Z

+## References
+
+- [Prometheus Best Practices](https://prometheus.io/docs/practices/naming/)
+- [Kubernetes Monitoring Guide](https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/)
+- [Prometheus Operator Documentation](https://prometheus-operator.dev/)
+- [Controller-Runtime Metrics](https://book.kubebuilder.io/reference/metrics.html)


I would remove those references here.

afritzler · 2026-03-05T05:34:50Z

+	})
+
+	It("should count servers by power state", func() {
+		ctx := context.Background()


afritzler · 2026-03-05T05:36:35Z

 kubectl apply -f https://raw.githubusercontent.com/<org>/metal-operator/<tag or branch>/dist/install.yaml
 ```

+## Monitoring


I would keep the monitoring part out of the README and only rely on the information provided in the docs folder.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

docs/.vitepress/config.mts (1)

88-94: ⚠️ Potential issue | 🟡 Minor

Expose the overview page from the new Observability group.

This PR adds docs/observability/README.md, but the sidebar only links to /observability/metrics. That leaves the section landing page without a sidebar entry.

🛠️ Suggested sidebar tweak

       {
         text: 'Observability',
         collapsed: false,
         items: [
+          { text: 'Overview', link: '/observability/' },
           { text: 'Metrics', link: '/observability/metrics' },
         ]
       },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/.vitepress/config.mts` around lines 88 - 94, The Observability section
in the sidebar object (the object with text 'Observability' and collapsed:
false) currently only lists the Metrics item; add an entry for the
overview/landing page by inserting an item with text 'Overview' (or
'Observability') and link '/observability/' (or '/observability') into the items
array before the Metrics entry so the newly added docs/observability/README.md
is reachable from the sidebar.

docs/observability/metrics.md (1)

191-229: ⚠️ Potential issue | 🟡 Minor

Make the availability alerts zero-safe.

These expressions stop matching when metal_server_state{state="Available"} is absent, so the example misses the exact “zero available servers” case. internal/metrics/server_metrics_test.go Lines 80-91 already shows the collector can emit no series at all.

🛠️ Suggested doc fix

-    - alert: NoAvailableServers
-      expr: metal_server_state{state="Available"} == 0
+    - alert: NoAvailableServers
+      expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 1
@@
-    - alert: LowAvailableServerCapacity
-      expr: metal_server_state{state="Available"} < 2
+    - alert: LowAvailableServerCapacity
+      expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 2

In PromQL alerting, why does `metric{label="value"} == 0` not fire when the series is absent, and what is the recommended pattern to treat a missing series as zero?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/observability/metrics.md` around lines 191 - 229, The alerts that check
metal_server_state{state="Available"} (e.g., NoAvailableServers and
LowAvailableServerCapacity) are not zero-safe because the expression stops
matching when the series is absent; update those rules to treat a missing series
as zero by OR-ing with absent(...) (for example, change the expr to something
like: (metal_server_state{state="Available"} == 0) or
absent(metal_server_state{state="Available"})) so the alert fires when the
metric is missing as well.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/metrics/server_metrics_test.go`:
- Around line 259-275: The test currently only asserts no panic when calling
collector.Collect, but must also verify that no metric series is emitted with
empty state labels; after calling collector.Collect(ch) and closing ch, drain
the channel and inspect each prometheus.Metric produced by the collector (using
the existing ch variable and collector.Collect call) to convert metrics to their
dto.MetricFamily or use prometheus/testutil.ToFloat64 where appropriate, then
assert that none of the metrics contain label pairs with Name "state" or
"power_state" with an empty Value (i.e., ensure label.Value != ""). Keep the
existing created server (server with empty Status) and add expectations that no
metric has state=="" and no metric has power_state=="" to enforce bounded-label
behavior.

---

Duplicate comments:
In `@docs/.vitepress/config.mts`:
- Around line 88-94: The Observability section in the sidebar object (the object
with text 'Observability' and collapsed: false) currently only lists the Metrics
item; add an entry for the overview/landing page by inserting an item with text
'Overview' (or 'Observability') and link '/observability/' (or '/observability')
into the items array before the Metrics entry so the newly added
docs/observability/README.md is reachable from the sidebar.

In `@docs/observability/metrics.md`:
- Around line 191-229: The alerts that check
metal_server_state{state="Available"} (e.g., NoAvailableServers and
LowAvailableServerCapacity) are not zero-safe because the expression stops
matching when the series is absent; update those rules to treat a missing series
as zero by OR-ing with absent(...) (for example, change the expr to something
like: (metal_server_state{state="Available"} == 0) or
absent(metal_server_state{state="Available"})) so the alert fires when the
metric is missing as well.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 38ce2386-8ddb-4088-9dc2-4527a612e560

📥 Commits

Reviewing files that changed from the base of the PR and between 06b2146 and d49ab7a.

📒 Files selected for processing (6)

config/prometheus/server_alerts.yaml
docs/.vitepress/config.mts
docs/observability/README.md
docs/observability/metrics.md
go.mod
internal/metrics/server_metrics_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

config/prometheus/server_alerts.yaml
go.mod

coderabbitai · 2026-03-06T10:09:05Z

+	It("should handle servers without state gracefully", func(ctx SpecContext) {
+		// Create server without state set
+		server := metalv1alpha1.Server{
+			ObjectMeta: metav1.ObjectMeta{Name: "server1"},
+			Spec:       metalv1alpha1.ServerSpec{SystemUUID: "uuid1"},
+			Status:     metalv1alpha1.ServerStatus{},
+		}
+
+		Expect(fakeClient.Create(ctx, &server)).To(Succeed())
+
+		// Collect metrics - should not panic
+		ch := make(chan prometheus.Metric, 100)
+		Expect(func() {
+			collector.Collect(ch)
+			close(ch)
+		}).NotTo(Panic())
+	})


⚠️ Potential issue | 🟡 Minor

Assert the empty-label case, not just the panic case.

This spec still passes if the collector starts exporting state="" or power_state="". Please inspect the collected metrics and assert that empty status fields do not create any series, otherwise the bounded-label behavior stays untested.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/metrics/server_metrics_test.go` around lines 259 - 275, The test currently only asserts no panic when calling collector.Collect, but must also verify that no metric series is emitted with empty state labels; after calling collector.Collect(ch) and closing ch, drain the channel and inspect each prometheus.Metric produced by the collector (using the existing ch variable and collector.Collect call) to convert metrics to their dto.MetricFamily or use prometheus/testutil.ToFloat64 where appropriate, then assert that none of the metrics contain label pairs with Name "state" or "power_state" with an empty Value (i.e., ensure label.Value != ""). Keep the existing created server (server with empty Status) and add expectations that no metric has state=="" and no metric has power_state=="" to enforce bounded-label behavior.

xkonni

approved with minor question

xkonni · 2026-03-06T14:36:50Z

+    interval: 30s
+    rules:
+    - alert: NoAvailableServers
+      expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 1 AND sum(metal_server_state{state="Reserved"} or on() vector(0)) == 0


why < 1 on the first and == 0 on the second?

it is the same in the end

all required changes were implemented.

* Add Prometheus metrics and alerts for server monitoring * Enhance Prometheus alerts and metrics for server monitoring - Update alert expressions for NoAvailableServers and LowAvailableServerCapacity to use sum for better accuracy. - Add new alert for missing server metrics. - Improve documentation for server condition status and example values. - Implement a timeout for metrics collection to prevent hanging scrapes. * Address PR #718 review comments

…tory function (#733) * Implement CheckBMCPendingComponentUpgrade for vendor-specific handling and integrate checks in BIOS and BMC upgrade processes * Remove duplicate linter exclusions for 'dupl' and 'lll' in bmc path * Fix linter exclusions for 'dupl' and 'lll' in bmc path * Refactor CheckBMCPendingComponentUpgrade to use ComponentType for type safety and improve clarity * Enhance CheckBMCPendingComponentUpgrade to validate component types and clarify behavior for local providers. Add better comment for AI BOT * add missing chart selector labels to registry service (#726) * Add Prometheus metrics and alerts for server monitoring (#718) * Add Prometheus metrics and alerts for server monitoring * Enhance Prometheus alerts and metrics for server monitoring - Update alert expressions for NoAvailableServers and LowAvailableServerCapacity to use sum for better accuracy. - Add new alert for missing server metrics. - Improve documentation for server condition status and example values. - Implement a timeout for metrics collection to prevent hanging scrapes. * Address PR #718 review comments * Add Redfish event subscription management for servers (#494) - Implement event and metrics subscription handling - Add subscription links to ServerStatus - Integrate with BMC interface for CreateEventSubscription/DeleteEventSubscription - Add proper error handling and logging - Include SPDX license headers - Fix lint issues and improve code quality This change enables automatic subscription to Redfish events and metrics reports for managed servers, improving observability and monitoring capabilities. * Bump docker/setup-buildx-action from 3 to 4 (#732) Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3 to 4. - [Release notes](https://github.com/docker/setup-buildx-action/releases) - [Commits](docker/setup-buildx-action@v3...v4) --- updated-dependencies: - dependency-name: docker/setup-buildx-action dependency-version: '4' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker/setup-qemu-action from 3 to 4 (#731) Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 3 to 4. - [Release notes](https://github.com/docker/setup-qemu-action/releases) - [Commits](docker/setup-qemu-action@v3...v4) --- updated-dependencies: - dependency-name: docker/setup-qemu-action dependency-version: '4' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker/metadata-action from 5 to 6 (#730) Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 5 to 6. - [Release notes](https://github.com/docker/metadata-action/releases) - [Commits](docker/metadata-action@v5...v6) --- updated-dependencies: - dependency-name: docker/metadata-action dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump golang from 1.26.0 to 1.26.1 (#729) Bumps golang from 1.26.0 to 1.26.1. --- updated-dependencies: - dependency-name: golang dependency-version: 1.26.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker/build-push-action from 6 to 7 (#728) Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@v6...v7) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump dompurify from 3.2.6 to 3.3.3 (#734) Bumps [dompurify](https://github.com/cure53/DOMPurify) from 3.2.6 to 3.3.3. - [Release notes](https://github.com/cure53/DOMPurify/releases) - [Commits](cure53/DOMPurify@3.2.6...3.3.3) --- updated-dependencies: - dependency-name: dompurify dependency-version: 3.3.3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump docker/login-action from 3 to 4 (#727) Bumps [docker/login-action](https://github.com/docker/login-action) from 3 to 4. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](docker/login-action@v3...v4) --- updated-dependencies: - dependency-name: docker/login-action dependency-version: '4' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Tidy go.mod (#736) * Bump golang.org/x/crypto from 0.48.0 to 0.49.0 (#738) Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.48.0 to 0.49.0. - [Commits](golang/crypto@v0.48.0...v0.49.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-version: 0.49.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Refactor CheckBMCPendingComponentUpgrade methods to simplify signature and clarify vendor-specific behavior for redfish kube and local. Adjust the comment in redfish.go to explain what the function CheckBMCPendingComponentUpgrade is doing --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Alan Sergeant <alan.sergeant@sap.com> Co-authored-by: Stefan Hipfel <stefan.hipfel@sap.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andreas Fritzler <andreas.fritzler@sap.com>

Add Prometheus metrics and alerts for server monitoring

c9a6ab8

github-project-automation Bot added this to Roadmap Mar 2, 2026

github-actions Bot added size/XL documentation Improvements or additions to documentation labels Mar 2, 2026

hardikdr added the area/metal-automation label Mar 3, 2026

stefanhipfel marked this pull request as ready for review March 3, 2026 08:59

stefanhipfel requested a review from a team as a code owner March 3, 2026 08:59

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

Comment thread config/prometheus/server_alerts.yaml

Comment thread config/prometheus/server_alerts.yaml

Comment thread docs/metrics.md Outdated

Comment thread internal/metrics/server_metrics.go Outdated

Comment thread internal/metrics/server_metrics.go Outdated

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

afritzler previously requested changes Mar 5, 2026

View reviewed changes

afritzler added the enhancement New feature or request label Mar 5, 2026

Address PR #718 review comments

d49ab7a

stefanhipfel requested a review from afritzler March 6, 2026 10:03

coderabbitai Bot reviewed Mar 6, 2026

View reviewed changes

xkonni approved these changes Mar 6, 2026

View reviewed changes

xkonni merged commit ab64dae into main Mar 11, 2026
16 of 17 checks passed

xkonni deleted the server_prom_metrics branch March 11, 2026 10:49

github-project-automation Bot moved this to Done in Roadmap Mar 11, 2026

coderabbitai Bot mentioned this pull request Mar 18, 2026

Implement vendor-specific BMC/BIOS upgrade checks with firmware inventory function #733

Merged

afritzler added the highlight label Apr 13, 2026

coderabbitai Bot mentioned this pull request Apr 17, 2026

Support BMC monitoring without relying on event subscriptions #813

Open

8 tasks

stefanhipfel mentioned this pull request Apr 21, 2026

Prometheus Metrics #474

Closed

This was referenced Apr 22, 2026

Improve network check metrics #821

Closed

Add Server Label to metrics #839

Merged

coderabbitai Bot mentioned this pull request May 8, 2026

Add example Perses dashboards deployable via Kustomize and Helm #872

Open

5 tasks

coderabbitai Bot mentioned this pull request May 14, 2026

Add configurable label enrichment to Redfish telemetry metrics #893

Open

This was referenced May 26, 2026

Make BMC sensor metric collection configurable #910

Open

Expose GPU/accelerator information in server inventory #912

Open

Conversation

stefanhipfel commented Mar 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

xkonni left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stefanhipfel commented Mar 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 3, 2026 •

edited

Loading