Which component does this relate to?
BMC management - specifically the MetricReport event subscription and metric collection system
What is the reason for this feature request or change?
Operators need control over which BMC sensor metrics are collected because:
-
Different vendor naming conventions: BMCs from different vendors (Dell, HPE, Lenovo, Supermicro) use different sensor names for similar metrics. For example, CPU temperature might be named "TempCPU1" (Dell), "CPU1Temp" (HPE), or "Processor1Temperature" (Lenovo).
-
Metric volume and cost: BMCs can send hundreds of sensor readings (temperatures, voltages, fan speeds, power metrics, etc.), but operators may only need specific metrics for their monitoring dashboards. Collecting unnecessary metrics increases Prometheus storage costs and metric cardinality.
-
No filtering mechanism: Currently all sensor data received via Redfish MetricReport events is collected and exposed as Prometheus metrics, with no way to include/exclude specific sensors.
-
Lack of normalization: There's no way to map vendor-specific sensor names to common normalized names, making it harder to create vendor-agnostic Grafana dashboards and alerts.
Describe the feature
Add configuration options to the BMC controller to control sensor metric collection:
-
Sensor filtering: Define allowlist/denylist patterns for sensor MetricIDs (e.g., include only CPU temperature and fan speed, exclude voltage sensors)
-
Sensor name mapping: Map vendor-specific sensor names to normalized metric names (e.g., map Dell's "TempCPU1" and HPE's "CPU1Temp" to a common "cpu_1_temperature")
-
Flexible configuration: Support both per-BMC configuration and global defaults
This would allow operators to:
- Reduce unnecessary metric collection and storage costs
- Create consistent metric names across different hardware vendors
- Focus monitoring on relevant sensors for their use case
Proposed API or behavior changes
Three potential approaches (open for discussion):
Option A: BMC CRD Spec Field
Add a metricsConfig field to the BMC CRD:
apiVersion: metal.ironcore.dev/v1alpha1
kind: BMC
metadata:
name: server-001-bmc
spec:
endpointRef:
name: server-001-endpoint
protocol:
name: Redfish
port: 443
# New metricsConfig field
metricsConfig:
sensorFilters:
include:
- "CPU*Temp*" # Match CPU temperature sensors
- "Fan*Speed" # Match fan speed sensors
- "Power*" # Match power metrics
exclude:
- "*Voltage*" # Exclude all voltage sensors
- "*Debug*" # Exclude debug metrics
sensorMappings:
# Normalize vendor-specific names
"TempCPU1": "cpu_1_temperature"
"CPU1Temp": "cpu_1_temperature"
"TempCPU2": "cpu_2_temperature"
"CPU2Temp": "cpu_2_temperature"
Pros: Simple, per-BMC configuration, easy to understand
Cons: Repetitive if many BMCs need the same config
Option B: ConfigMap Reference
Reference a ConfigMap for shared configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: bmc-metrics-config
namespace: metal-operator-system
data:
filters.yaml: |
include:
- "CPU*Temp*"
- "Fan*Speed"
- "Power*"
exclude:
- "*Voltage*"
- "*Debug*"
mappings.yaml: |
# Vendor-specific mappings
dell:
"TempCPU1": "cpu_1_temperature"
"TempCPU2": "cpu_2_temperature"
"FanSpeed1": "fan_1_speed"
hpe:
"CPU1Temp": "cpu_1_temperature"
"CPU2Temp": "cpu_2_temperature"
"Fan1Speed": "fan_1_speed"
lenovo:
"Processor1Temperature": "cpu_1_temperature"
"Processor2Temperature": "cpu_2_temperature"
"SystemFan1Speed": "fan_1_speed"
---
apiVersion: metal.ironcore.dev/v1alpha1
kind: BMC
metadata:
name: server-001-bmc
spec:
endpointRef:
name: server-001-endpoint
# Reference shared config
metricsConfigRef:
name: bmc-metrics-config
Pros: Shared config across BMCs, vendor-aware mappings, easier to maintain
Cons: More complex, requires ConfigMap management
Option C: Operator-Level Configuration
Global defaults via operator command-line flags or operator ConfigMap, with per-BMC overrides:
# Operator flags
--metrics-sensor-include="CPU*Temp*,Fan*Speed"
--metrics-sensor-exclude="*Voltage*"
--metrics-config-map="default/bmc-metrics-config"
Pros: Simple defaults for all BMCs, opt-in overrides
Cons: Less flexible for heterogeneous environments
Alternatives considered
1. Prometheus Recording Rules
Filter/rename metrics using Prometheus recording rules after collection.
Rejected because: This doesn't reduce metric cardinality at the source. All metrics are still collected, transmitted, stored, and queried before being filtered.
2. Prometheus Scrape Config Filtering
Use metric_relabel_configs in Prometheus to drop unwanted metrics.
Rejected because: Same issue - metrics are still collected and exposed by the operator, just dropped at scrape time.
Additional context
Current Implementation
- Metric collection:
internal/serverevents/metrics.go - RedfishEventCollector processes MetricReport events
- Event server:
internal/serverevents/server.go - HTTP server receives /serverevents/metricsreport/:hostname requests
- Vendor implementations:
bmc/redfish_dell.go, bmc/redfish_hpe.go, bmc/redfish_lenovo.go, bmc/redfish_supermicro.go
- Prometheus metrics:
redfish_monitor_reading (sensor values), redfish_event_alert_total (alert counts)
Current Behavior
- All MetricReport events are processed without filtering
- Sensor names are derived from
MetricID + MetricProperty fields
- Simple type guessing (e.g., "temp" in MetricID → "Temperature" type)
- Metrics exposed with labels:
hostname, metric_id, type, unit, origin_context
Which component does this relate to?
BMC management - specifically the MetricReport event subscription and metric collection system
What is the reason for this feature request or change?
Operators need control over which BMC sensor metrics are collected because:
Different vendor naming conventions: BMCs from different vendors (Dell, HPE, Lenovo, Supermicro) use different sensor names for similar metrics. For example, CPU temperature might be named "TempCPU1" (Dell), "CPU1Temp" (HPE), or "Processor1Temperature" (Lenovo).
Metric volume and cost: BMCs can send hundreds of sensor readings (temperatures, voltages, fan speeds, power metrics, etc.), but operators may only need specific metrics for their monitoring dashboards. Collecting unnecessary metrics increases Prometheus storage costs and metric cardinality.
No filtering mechanism: Currently all sensor data received via Redfish MetricReport events is collected and exposed as Prometheus metrics, with no way to include/exclude specific sensors.
Lack of normalization: There's no way to map vendor-specific sensor names to common normalized names, making it harder to create vendor-agnostic Grafana dashboards and alerts.
Describe the feature
Add configuration options to the BMC controller to control sensor metric collection:
Sensor filtering: Define allowlist/denylist patterns for sensor MetricIDs (e.g., include only CPU temperature and fan speed, exclude voltage sensors)
Sensor name mapping: Map vendor-specific sensor names to normalized metric names (e.g., map Dell's "TempCPU1" and HPE's "CPU1Temp" to a common "cpu_1_temperature")
Flexible configuration: Support both per-BMC configuration and global defaults
This would allow operators to:
Proposed API or behavior changes
Three potential approaches (open for discussion):
Option A: BMC CRD Spec Field
Add a
metricsConfigfield to the BMC CRD:Pros: Simple, per-BMC configuration, easy to understand
Cons: Repetitive if many BMCs need the same config
Option B: ConfigMap Reference
Reference a ConfigMap for shared configuration:
Pros: Shared config across BMCs, vendor-aware mappings, easier to maintain
Cons: More complex, requires ConfigMap management
Option C: Operator-Level Configuration
Global defaults via operator command-line flags or operator ConfigMap, with per-BMC overrides:
Pros: Simple defaults for all BMCs, opt-in overrides
Cons: Less flexible for heterogeneous environments
Alternatives considered
1. Prometheus Recording Rules
Filter/rename metrics using Prometheus recording rules after collection.
Rejected because: This doesn't reduce metric cardinality at the source. All metrics are still collected, transmitted, stored, and queried before being filtered.
2. Prometheus Scrape Config Filtering
Use
metric_relabel_configsin Prometheus to drop unwanted metrics.Rejected because: Same issue - metrics are still collected and exposed by the operator, just dropped at scrape time.
Additional context
Current Implementation
internal/serverevents/metrics.go-RedfishEventCollectorprocesses MetricReport eventsinternal/serverevents/server.go- HTTP server receives/serverevents/metricsreport/:hostnamerequestsbmc/redfish_dell.go,bmc/redfish_hpe.go,bmc/redfish_lenovo.go,bmc/redfish_supermicro.goredfish_monitor_reading(sensor values),redfish_event_alert_total(alert counts)Current Behavior
MetricID + MetricPropertyfieldshostname,metric_id,type,unit,origin_context