-
Notifications
You must be signed in to change notification settings - Fork 18
Add Prometheus metrics and alerts for server monitoring #718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PrometheusRule | ||
| metadata: | ||
| name: metal-operator-server-alerts | ||
| namespace: metal-operator-system | ||
| labels: | ||
| control-plane: controller-manager | ||
| spec: | ||
| groups: | ||
| - name: metal_operator_servers | ||
| interval: 30s | ||
| rules: | ||
| - alert: NoAvailableServers | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really want this rule? As the goal will be to use all available servers, this alert will always fire. @coderabbitai what do you think?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: Length of output: 2978
A few specific issues:
Suggested options:
Option A is likely sufficient given the existing alert coverage, but Option B could be useful if operators specifically want to detect a completely idle and empty fleet.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @stefanhipfel I think option B is actually what we want.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 1 | ||
| for: 5m | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| annotations: | ||
| summary: "No available servers in the fleet" | ||
| description: "All servers are either Reserved, in Maintenance, or in Error state" | ||
| labels: | ||
| severity: warning | ||
|
|
||
| - alert: ServersInErrorState | ||
| expr: metal_server_state{state="Error"} > 0 | ||
| for: 2m | ||
| annotations: | ||
| summary: "Servers are in Error state" | ||
| description: "{{ $value }} server(s) are in Error state and require attention" | ||
| labels: | ||
| severity: critical | ||
|
|
||
| - alert: ServersPoweringOnTooLong | ||
| expr: metal_server_power_state{power_state="PoweringOn"} > 0 | ||
| for: 10m | ||
| annotations: | ||
| summary: "Servers stuck in PoweringOn state" | ||
| description: "{{ $value }} server(s) have been in PoweringOn state for over 10 minutes" | ||
| labels: | ||
| severity: warning | ||
|
|
||
| - alert: ServersPoweringOffTooLong | ||
| expr: metal_server_power_state{power_state="PoweringOff"} > 0 | ||
| for: 10m | ||
| annotations: | ||
| summary: "Servers stuck in PoweringOff state" | ||
| description: "{{ $value }} server(s) have been in PoweringOff state for over 10 minutes" | ||
| labels: | ||
| severity: warning | ||
|
|
||
| - alert: HighReconciliationErrorRate | ||
| expr: rate(metal_server_reconciliation_total{result=~"error_.*"}[5m]) > 0.1 | ||
| for: 5m | ||
| annotations: | ||
| summary: "High server reconciliation error rate" | ||
| description: "Server reconciliation errors are occurring at {{ $value | humanize }} per second" | ||
| labels: | ||
| severity: warning | ||
|
|
||
| - alert: LowAvailableServerCapacity | ||
| expr: sum(metal_server_state{state="Available"} or on() vector(0)) < 2 | ||
| for: 5m | ||
| annotations: | ||
| summary: "Low available server capacity" | ||
| description: "Only {{ $value }} server(s) are available" | ||
| labels: | ||
| severity: warning | ||
|
|
||
| - alert: ServerMetricsMissing | ||
| expr: absent(metal_server_state{state="Available"}) | ||
| for: 5m | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| annotations: | ||
| summary: "Server metrics are not being collected" | ||
| description: "The metal-operator metrics endpoint is not reporting server state metrics" | ||
| labels: | ||
| severity: critical | ||
|
|
||
| - alert: ServerReconciliationFailureSpike | ||
| expr: | | ||
| ( | ||
| sum(rate(metal_server_reconciliation_total{result=~"error_.*"}[5m])) | ||
| / | ||
| sum(rate(metal_server_reconciliation_total[5m])) | ||
| ) > 0.5 | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| for: 10m | ||
| annotations: | ||
| summary: "High rate of server reconciliation failures" | ||
| description: "More than 50% of server reconciliations are failing" | ||
| labels: | ||
| severity: critical | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the monitoring part out of the README and only rely on the information provided in the
docsfolder.