Skip to content

Support BMC monitoring without relying on event subscriptions #813

@stefanhipfel

Description

@stefanhipfel

Problem

The metal-operator currently relies on Redfish event subscriptions for BMC monitoring. However, many BMC vendors/models do not support event subscriptions reliably:

  • HPE iLO has limited DeliveryRetryPolicy support
  • Older BMC firmware versions lack EventService entirely
  • Some vendors have incomplete or buggy event implementations
  • Event subscriptions can be lost during BMC resets/firmware updates

This limits our ability to monitor BMCs consistently across heterogeneous environments.

Solution

Implement a polling-based monitoring system as the default monitoring mode with flexible deployment options.

Key Features

  • Universal compatibility: Polling works with all BMC vendors (no EventService required)
  • Flexible deployment: Run monitoring as part of the operator (default) OR as a standalone pod
  • Smart caching: Collected metrics stored in memory; Prometheus scrapes from cache (not directly from BMCs)
  • Efficient: Session pooling prevents exhausting BMC connection limits
  • Vendor-aware: Handles endpoint variations across HPE, Dell, Supermicro, Lenovo
  • HA-ready: Leader election ensures only one instance collects metrics

Architecture Overview

Two Deployment Options:

  1. Embedded Mode (Default): Monitoring runs as a component within the operator pod

    • Simpler deployment
    • Shared resources with operator
    • Good for most use cases
  2. Standalone Mode: Monitoring runs as a separate deployment

    • Dedicated resources for monitoring
    • Independent scaling
    • Better for large deployments (>500 BMCs)
    • Can be updated independently from operator

Data Flow:

BMCs → Polling/Events → In-Memory Cache → Prometheus

The cache is key: it decouples Prometheus scraping (every 15-30s) from BMC polling (every > 120s), preventing excessive load on BMCs.

What Gets Monitored

  • Sensors: Temperature, fan speeds, power supply status, voltages
  • Alerts: Critical events from BMC event logs
  • SEL: System Event Log entries for hardware events

Monitoring Modes

  • Polling (default): Actively queries BMC endpoints every 60-120 seconds

    • Works with all BMCs
    • Predictable load pattern
  • Events: Uses Redfish event subscriptions (existing implementation)

    • Lower latency when it works
    • Only for BMCs with reliable EventService support

Mode is configured globally for all BMCs via operator CLI flag.

Implementation Plan

Phase 1: Foundation

  • Design architecture and interfaces
  • Implement embedded monitoring runnable
  • Implement standalone deployment option
  • Add configuration and validation

Phase 2: Data Collection

  • Sensor polling (Thermal, Power, Sensors endpoints)
  • Alert/event log polling
  • SEL polling

Phase 3: Integration

  • BMC status updates
  • Prometheus alerts and Grafana dashboards

Phase 4: Quality

  • Unit tests (≥80% coverage)
  • Integration tests with Redfish simulator
  • Documentation

Acceptance Criteria

  • Polling works with all BMC vendors (HPE, Dell, Supermicro, Lenovo)
  • Can deploy as embedded runnable or standalone pod
  • Metrics cached in memory (Prometheus scrapes don't hit BMCs)
  • Session pooling prevents BMC connection exhaustion
  • Sensor, alert, and SEL data collected and exposed via Prometheus
  • BMC status reflects monitoring health
  • Works in HA deployments (leader election)
  • Comprehensive tests and documentation

Sub-Issues

Will be created to break down implementation into manageable tasks across 4 phases.

Non-Goals

  • Historical data storage (use Prometheus for this)
  • BMC firmware updates or auto-remediation
  • Custom alerting (users configure Prometheus/Alertmanager)
  • Multi-cluster monitoring

References

  • Current event subscription: internal/serverevents/
  • BMC controller: internal/controller/bmc_controller.go
  • Redfish client: bmc/redfish.go

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions