Skip to content

Make BMC sensor metric collection configurable #910

@stefanhipfel

Description

@stefanhipfel

Which component does this relate to?

BMC management - specifically the MetricReport event subscription and metric collection system

What is the reason for this feature request or change?

Operators need control over which BMC sensor metrics are collected because:

  • Different vendor naming conventions: BMCs from different vendors (Dell, HPE, Lenovo, Supermicro) use different sensor names for similar metrics. For example, CPU temperature might be named "TempCPU1" (Dell), "CPU1Temp" (HPE), or "Processor1Temperature" (Lenovo).

  • Metric volume and cost: BMCs can send hundreds of sensor readings (temperatures, voltages, fan speeds, power metrics, etc.), but operators may only need specific metrics for their monitoring dashboards. Collecting unnecessary metrics increases Prometheus storage costs and metric cardinality.

  • No filtering mechanism: Currently all sensor data received via Redfish MetricReport events is collected and exposed as Prometheus metrics, with no way to include/exclude specific sensors.

  • Lack of normalization: There's no way to map vendor-specific sensor names to common normalized names, making it harder to create vendor-agnostic Grafana dashboards and alerts.

Describe the feature

Add configuration options to the BMC controller to control sensor metric collection:

  1. Sensor filtering: Define allowlist/denylist patterns for sensor MetricIDs (e.g., include only CPU temperature and fan speed, exclude voltage sensors)

  2. Sensor name mapping: Map vendor-specific sensor names to normalized metric names (e.g., map Dell's "TempCPU1" and HPE's "CPU1Temp" to a common "cpu_1_temperature")

  3. Flexible configuration: Support both per-BMC configuration and global defaults

This would allow operators to:

  • Reduce unnecessary metric collection and storage costs
  • Create consistent metric names across different hardware vendors
  • Focus monitoring on relevant sensors for their use case

Proposed API or behavior changes

Three potential approaches (open for discussion):

Option A: BMC CRD Spec Field

Add a metricsConfig field to the BMC CRD:

apiVersion: metal.ironcore.dev/v1alpha1
kind: BMC
metadata:
  name: server-001-bmc
spec:
  endpointRef:
    name: server-001-endpoint
  protocol:
    name: Redfish
    port: 443
  
  # New metricsConfig field
  metricsConfig:
    sensorFilters:
      include:
        - "CPU*Temp*"      # Match CPU temperature sensors
        - "Fan*Speed"      # Match fan speed sensors
        - "Power*"         # Match power metrics
      exclude:
        - "*Voltage*"      # Exclude all voltage sensors
        - "*Debug*"        # Exclude debug metrics
    
    sensorMappings:
      # Normalize vendor-specific names
      "TempCPU1": "cpu_1_temperature"
      "CPU1Temp": "cpu_1_temperature"
      "TempCPU2": "cpu_2_temperature"
      "CPU2Temp": "cpu_2_temperature"

Pros: Simple, per-BMC configuration, easy to understand
Cons: Repetitive if many BMCs need the same config

Option B: ConfigMap Reference

Reference a ConfigMap for shared configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: bmc-metrics-config
  namespace: metal-operator-system
data:
  filters.yaml: |
    include:
      - "CPU*Temp*"
      - "Fan*Speed"
      - "Power*"
    exclude:
      - "*Voltage*"
      - "*Debug*"
  
  mappings.yaml: |
    # Vendor-specific mappings
    dell:
      "TempCPU1": "cpu_1_temperature"
      "TempCPU2": "cpu_2_temperature"
      "FanSpeed1": "fan_1_speed"
    
    hpe:
      "CPU1Temp": "cpu_1_temperature"
      "CPU2Temp": "cpu_2_temperature"
      "Fan1Speed": "fan_1_speed"
    
    lenovo:
      "Processor1Temperature": "cpu_1_temperature"
      "Processor2Temperature": "cpu_2_temperature"
      "SystemFan1Speed": "fan_1_speed"
---
apiVersion: metal.ironcore.dev/v1alpha1
kind: BMC
metadata:
  name: server-001-bmc
spec:
  endpointRef:
    name: server-001-endpoint
  
  # Reference shared config
  metricsConfigRef:
    name: bmc-metrics-config

Pros: Shared config across BMCs, vendor-aware mappings, easier to maintain
Cons: More complex, requires ConfigMap management

Option C: Operator-Level Configuration

Global defaults via operator command-line flags or operator ConfigMap, with per-BMC overrides:

# Operator flags
--metrics-sensor-include="CPU*Temp*,Fan*Speed"
--metrics-sensor-exclude="*Voltage*"
--metrics-config-map="default/bmc-metrics-config"

Pros: Simple defaults for all BMCs, opt-in overrides
Cons: Less flexible for heterogeneous environments

Alternatives considered

1. Prometheus Recording Rules

Filter/rename metrics using Prometheus recording rules after collection.

Rejected because: This doesn't reduce metric cardinality at the source. All metrics are still collected, transmitted, stored, and queried before being filtered.

2. Prometheus Scrape Config Filtering

Use metric_relabel_configs in Prometheus to drop unwanted metrics.

Rejected because: Same issue - metrics are still collected and exposed by the operator, just dropped at scrape time.

Additional context

Current Implementation

  • Metric collection: internal/serverevents/metrics.go - RedfishEventCollector processes MetricReport events
  • Event server: internal/serverevents/server.go - HTTP server receives /serverevents/metricsreport/:hostname requests
  • Vendor implementations: bmc/redfish_dell.go, bmc/redfish_hpe.go, bmc/redfish_lenovo.go, bmc/redfish_supermicro.go
  • Prometheus metrics: redfish_monitor_reading (sensor values), redfish_event_alert_total (alert counts)

Current Behavior

  • All MetricReport events are processed without filtering
  • Sensor names are derived from MetricID + MetricProperty fields
  • Simple type guessing (e.g., "temp" in MetricID → "Temperature" type)
  • Metrics exposed with labels: hostname, metric_id, type, unit, origin_context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions