Make BMC sensor metric collection configurable

## Which component does this relate to?

BMC management - specifically the MetricReport event subscription and metric collection system

## What is the reason for this feature request or change?

Operators need control over which BMC sensor metrics are collected because:

- **Different vendor naming conventions**: BMCs from different vendors (Dell, HPE, Lenovo, Supermicro) use different sensor names for similar metrics. For example, CPU temperature might be named "TempCPU1" (Dell), "CPU1Temp" (HPE), or "Processor1Temperature" (Lenovo).

- **Metric volume and cost**: BMCs can send hundreds of sensor readings (temperatures, voltages, fan speeds, power metrics, etc.), but operators may only need specific metrics for their monitoring dashboards. Collecting unnecessary metrics increases Prometheus storage costs and metric cardinality.

- **No filtering mechanism**: Currently all sensor data received via Redfish MetricReport events is collected and exposed as Prometheus metrics, with no way to include/exclude specific sensors.

- **Lack of normalization**: There's no way to map vendor-specific sensor names to common normalized names, making it harder to create vendor-agnostic Grafana dashboards and alerts.

## Describe the feature

Add configuration options to the BMC controller to control sensor metric collection:

1. **Sensor filtering**: Define allowlist/denylist patterns for sensor MetricIDs (e.g., include only CPU temperature and fan speed, exclude voltage sensors)

2. **Sensor name mapping**: Map vendor-specific sensor names to normalized metric names (e.g., map Dell's "TempCPU1" and HPE's "CPU1Temp" to a common "cpu_1_temperature")

3. **Flexible configuration**: Support both per-BMC configuration and global defaults

This would allow operators to:
- Reduce unnecessary metric collection and storage costs
- Create consistent metric names across different hardware vendors
- Focus monitoring on relevant sensors for their use case

## Proposed API or behavior changes

Three potential approaches (open for discussion):

### Option A: BMC CRD Spec Field

Add a `metricsConfig` field to the BMC CRD:

```yaml
apiVersion: metal.ironcore.dev/v1alpha1
kind: BMC
metadata:
  name: server-001-bmc
spec:
  endpointRef:
    name: server-001-endpoint
  protocol:
    name: Redfish
    port: 443
  
  # New metricsConfig field
  metricsConfig:
    sensorFilters:
      include:
        - "CPU*Temp*"      # Match CPU temperature sensors
        - "Fan*Speed"      # Match fan speed sensors
        - "Power*"         # Match power metrics
      exclude:
        - "*Voltage*"      # Exclude all voltage sensors
        - "*Debug*"        # Exclude debug metrics
    
    sensorMappings:
      # Normalize vendor-specific names
      "TempCPU1": "cpu_1_temperature"
      "CPU1Temp": "cpu_1_temperature"
      "TempCPU2": "cpu_2_temperature"
      "CPU2Temp": "cpu_2_temperature"
```

**Pros**: Simple, per-BMC configuration, easy to understand
**Cons**: Repetitive if many BMCs need the same config

### Option B: ConfigMap Reference

Reference a ConfigMap for shared configuration:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: bmc-metrics-config
  namespace: metal-operator-system
data:
  filters.yaml: |
    include:
      - "CPU*Temp*"
      - "Fan*Speed"
      - "Power*"
    exclude:
      - "*Voltage*"
      - "*Debug*"
  
  mappings.yaml: |
    # Vendor-specific mappings
    dell:
      "TempCPU1": "cpu_1_temperature"
      "TempCPU2": "cpu_2_temperature"
      "FanSpeed1": "fan_1_speed"
    
    hpe:
      "CPU1Temp": "cpu_1_temperature"
      "CPU2Temp": "cpu_2_temperature"
      "Fan1Speed": "fan_1_speed"
    
    lenovo:
      "Processor1Temperature": "cpu_1_temperature"
      "Processor2Temperature": "cpu_2_temperature"
      "SystemFan1Speed": "fan_1_speed"
---
apiVersion: metal.ironcore.dev/v1alpha1
kind: BMC
metadata:
  name: server-001-bmc
spec:
  endpointRef:
    name: server-001-endpoint
  
  # Reference shared config
  metricsConfigRef:
    name: bmc-metrics-config
```

**Pros**: Shared config across BMCs, vendor-aware mappings, easier to maintain
**Cons**: More complex, requires ConfigMap management

### Option C: Operator-Level Configuration

Global defaults via operator command-line flags or operator ConfigMap, with per-BMC overrides:

```bash
# Operator flags
--metrics-sensor-include="CPU*Temp*,Fan*Speed"
--metrics-sensor-exclude="*Voltage*"
--metrics-config-map="default/bmc-metrics-config"
```

**Pros**: Simple defaults for all BMCs, opt-in overrides
**Cons**: Less flexible for heterogeneous environments

## Alternatives considered

### 1. Prometheus Recording Rules
Filter/rename metrics using Prometheus recording rules after collection.

**Rejected because**: This doesn't reduce metric cardinality at the source. All metrics are still collected, transmitted, stored, and queried before being filtered.

### 2. Prometheus Scrape Config Filtering
Use `metric_relabel_configs` in Prometheus to drop unwanted metrics.

**Rejected because**: Same issue - metrics are still collected and exposed by the operator, just dropped at scrape time.

## Additional context

### Current Implementation

- **Metric collection**: `internal/serverevents/metrics.go` - `RedfishEventCollector` processes MetricReport events
- **Event server**: `internal/serverevents/server.go` - HTTP server receives `/serverevents/metricsreport/:hostname` requests
- **Vendor implementations**: `bmc/redfish_dell.go`, `bmc/redfish_hpe.go`, `bmc/redfish_lenovo.go`, `bmc/redfish_supermicro.go`
- **Prometheus metrics**: `redfish_monitor_reading` (sensor values), `redfish_event_alert_total` (alert counts)

### Current Behavior

- All MetricReport events are processed without filtering
- Sensor names are derived from `MetricID + MetricProperty` fields
- Simple type guessing (e.g., "temp" in MetricID → "Temperature" type)
- Metrics exposed with labels: `hostname`, `metric_id`, `type`, `unit`, `origin_context`



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make BMC sensor metric collection configurable #910

Which component does this relate to?

What is the reason for this feature request or change?

Describe the feature

Proposed API or behavior changes

Option A: BMC CRD Spec Field

Option B: ConfigMap Reference

Option C: Operator-Level Configuration

Alternatives considered

1. Prometheus Recording Rules

2. Prometheus Scrape Config Filtering

Additional context

Current Implementation

Current Behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Make BMC sensor metric collection configurable #910

Description

Which component does this relate to?

What is the reason for this feature request or change?

Describe the feature

Proposed API or behavior changes

Option A: BMC CRD Spec Field

Option B: ConfigMap Reference

Option C: Operator-Level Configuration

Alternatives considered

1. Prometheus Recording Rules

2. Prometheus Scrape Config Filtering

Additional context

Current Implementation

Current Behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions