Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,27 @@ Users can just run kubectl apply -f <URL for YAML BUNDLE> to install the project
kubectl apply -f https://raw.githubusercontent.com/<org>/metal-operator/<tag or branch>/dist/install.yaml
```

## Monitoring

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the monitoring part out of the README and only rely on the information provided in the docs folder.


The metal-operator exposes custom Prometheus metrics for monitoring server state, power operations, and reconciliation performance. Metrics are available at the `/metrics` endpoint and include:

- **Server State Distribution** - Count of servers by state (Available, Reserved, Error, etc.)
- **Server Power State** - Count of servers by power state (On, Off, PoweringOn, etc.)
- **Server Conditions** - Health status of server conditions (Ready, Discovered, etc.)
- **Reconciliation Metrics** - Success/error counts for reconciliation operations

For detailed metrics documentation, example queries, and alerting rules, see [docs/metrics.md](docs/metrics.md).

### Quick Example

```bash
# Port-forward to metrics endpoint
kubectl -n metal-operator-system port-forward deployment/metal-operator-controller-manager 8443:8443

# Query server metrics
curl -k https://localhost:8443/metrics | grep metal_server
```

## Contributing

**NOTE:** Run `make help` for more information on all potential `make` targets
Expand Down
7 changes: 7 additions & 0 deletions cmd/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ import (
"sigs.k8s.io/controller-runtime/pkg/certwatcher"
"sigs.k8s.io/controller-runtime/pkg/healthz"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
ctrlmetrics "sigs.k8s.io/controller-runtime/pkg/metrics"
"sigs.k8s.io/controller-runtime/pkg/metrics/filters"
metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
"sigs.k8s.io/controller-runtime/pkg/webhook"
Expand All @@ -38,6 +39,7 @@ import (
metalv1alpha1 "github.com/ironcore-dev/metal-operator/api/v1alpha1"
"github.com/ironcore-dev/metal-operator/internal/api/macdb"
"github.com/ironcore-dev/metal-operator/internal/controller"
metalmetrics "github.com/ironcore-dev/metal-operator/internal/metrics"
"github.com/ironcore-dev/metal-operator/internal/registry"
// +kubebuilder:scaffold:imports
)
Expand Down Expand Up @@ -321,6 +323,11 @@ func main() { // nolint: gocyclo
os.Exit(1)
}

// Register custom Prometheus metrics collector for server states
serverCollector := metalmetrics.NewServerStateCollector(mgr.GetClient())
ctrlmetrics.Registry.MustRegister(serverCollector)
setupLog.Info("Registered custom server metrics collector")

if err = (&controller.EndpointReconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
Expand Down
88 changes: 88 additions & 0 deletions config/prometheus/server_alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: metal-operator-server-alerts
namespace: metal-operator-system
labels:
control-plane: controller-manager
spec:
groups:
- name: metal_operator_servers
interval: 30s
rules:
- alert: NoAvailableServers

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want this rule? As the goal will be to use all available servers, this alert will always fire. @coderabbitai what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat config/prometheus/server_alerts.yaml

Length of output: 2978


@afritzler, you raise a very valid point. The NoAvailableServers alert as currently defined will indeed fire whenever the fleet is fully utilized — which is the desired operational state. The alert's own description already betrays the ambiguity: it lists "Reserved" (healthy, in-use) alongside "Maintenance" and "Error" (genuinely problematic) as possible causes.

A few specific issues:

  1. False positives in healthy operation: When all servers are Reserved, this alert fires, yet nothing is wrong.
  2. Redundancy with LowAvailableServerCapacity: That alert already fires at < 2 available servers, which is a strict superset of the zero case. NoAvailableServers adds no new signal.
  3. No distinction between "all used" vs. "all broken": The genuinely alarming scenario — zero available servers because they are all in Error/Maintenance — is already partially covered by ServersInErrorState.

Suggested options:

  • Option A (Simplest): Remove NoAvailableServers entirely and rely on LowAvailableServerCapacity + ServersInErrorState to cover the relevant failure modes.

  • Option B: Reformulate it to only fire when zero servers are available and zero are reserved (meaning the fleet isn't productively in use at all):

    expr: |
      sum(metal_server_state{state="Available"} or on() vector(0)) < 1
      AND
      sum(metal_server_state{state="Reserved"} or on() vector(0)) == 0

    This would correctly distinguish between "full utilization" and "fleet is completely idle but nothing is available," which is the more actionable signal.

Option A is likely sufficient given the existing alert coverage, but Option B could be useful if operators specifically want to detect a completely idle and empty fleet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanhipfel I think option B is actually what we want.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 1
for: 5m
Comment thread
coderabbitai[bot] marked this conversation as resolved.
annotations:
summary: "No available servers in the fleet"
description: "All servers are either Reserved, in Maintenance, or in Error state"
labels:
severity: warning

- alert: ServersInErrorState
expr: metal_server_state{state="Error"} > 0
for: 2m
annotations:
summary: "Servers are in Error state"
description: "{{ $value }} server(s) are in Error state and require attention"
labels:
severity: critical

- alert: ServersPoweringOnTooLong
expr: metal_server_power_state{power_state="PoweringOn"} > 0
for: 10m
annotations:
summary: "Servers stuck in PoweringOn state"
description: "{{ $value }} server(s) have been in PoweringOn state for over 10 minutes"
labels:
severity: warning

- alert: ServersPoweringOffTooLong
expr: metal_server_power_state{power_state="PoweringOff"} > 0
for: 10m
annotations:
summary: "Servers stuck in PoweringOff state"
description: "{{ $value }} server(s) have been in PoweringOff state for over 10 minutes"
labels:
severity: warning

- alert: HighReconciliationErrorRate
expr: rate(metal_server_reconciliation_total{result=~"error_.*"}[5m]) > 0.1
for: 5m
annotations:
summary: "High server reconciliation error rate"
description: "Server reconciliation errors are occurring at {{ $value | humanize }} per second"
labels:
severity: warning

- alert: LowAvailableServerCapacity
expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 2
for: 5m
annotations:
summary: "Low available server capacity"
description: "Only {{ $value }} server(s) are available"
labels:
severity: warning

- alert: ServerMetricsMissing
expr: absent(metal_server_state{state="Available"})
for: 5m
Comment thread
coderabbitai[bot] marked this conversation as resolved.
annotations:
summary: "Server metrics are not being collected"
description: "The metal-operator metrics endpoint is not reporting server state metrics"
labels:
severity: critical

- alert: ServerReconciliationFailureSpike
expr: |
(
sum(rate(metal_server_reconciliation_total{result=~"error_.*"}[5m]))
/
sum(rate(metal_server_reconciliation_total[5m]))
) > 0.5
Comment thread
coderabbitai[bot] marked this conversation as resolved.
for: 10m
annotations:
summary: "High rate of server reconciliation failures"
description: "More than 50% of server reconciliations are failing"
labels:
severity: critical
Loading
Loading