Problem
The metal-operator currently relies on Redfish event subscriptions for BMC monitoring. However, many BMC vendors/models do not support event subscriptions reliably:
- HPE iLO has limited DeliveryRetryPolicy support
- Older BMC firmware versions lack EventService entirely
- Some vendors have incomplete or buggy event implementations
- Event subscriptions can be lost during BMC resets/firmware updates
This limits our ability to monitor BMCs consistently across heterogeneous environments.
Solution
Implement a polling-based monitoring system as the default monitoring mode with flexible deployment options.
Key Features
- Universal compatibility: Polling works with all BMC vendors (no EventService required)
- Flexible deployment: Run monitoring as part of the operator (default) OR as a standalone pod
- Smart caching: Collected metrics stored in memory; Prometheus scrapes from cache (not directly from BMCs)
- Efficient: Session pooling prevents exhausting BMC connection limits
- Vendor-aware: Handles endpoint variations across HPE, Dell, Supermicro, Lenovo
- HA-ready: Leader election ensures only one instance collects metrics
Architecture Overview
Two Deployment Options:
-
Embedded Mode (Default): Monitoring runs as a component within the operator pod
- Simpler deployment
- Shared resources with operator
- Good for most use cases
-
Standalone Mode: Monitoring runs as a separate deployment
- Dedicated resources for monitoring
- Independent scaling
- Better for large deployments (>500 BMCs)
- Can be updated independently from operator
Data Flow:
BMCs → Polling/Events → In-Memory Cache → Prometheus
The cache is key: it decouples Prometheus scraping (every 15-30s) from BMC polling (every > 120s), preventing excessive load on BMCs.
What Gets Monitored
- Sensors: Temperature, fan speeds, power supply status, voltages
- Alerts: Critical events from BMC event logs
- SEL: System Event Log entries for hardware events
Monitoring Modes
Mode is configured globally for all BMCs via operator CLI flag.
Implementation Plan
Phase 1: Foundation
- Design architecture and interfaces
- Implement embedded monitoring runnable
- Implement standalone deployment option
- Add configuration and validation
Phase 2: Data Collection
- Sensor polling (Thermal, Power, Sensors endpoints)
- Alert/event log polling
- SEL polling
Phase 3: Integration
- BMC status updates
- Prometheus alerts and Grafana dashboards
Phase 4: Quality
- Unit tests (≥80% coverage)
- Integration tests with Redfish simulator
- Documentation
Acceptance Criteria
Sub-Issues
Will be created to break down implementation into manageable tasks across 4 phases.
Non-Goals
- Historical data storage (use Prometheus for this)
- BMC firmware updates or auto-remediation
- Custom alerting (users configure Prometheus/Alertmanager)
- Multi-cluster monitoring
References
- Current event subscription:
internal/serverevents/
- BMC controller:
internal/controller/bmc_controller.go
- Redfish client:
bmc/redfish.go
Problem
The metal-operator currently relies on Redfish event subscriptions for BMC monitoring. However, many BMC vendors/models do not support event subscriptions reliably:
This limits our ability to monitor BMCs consistently across heterogeneous environments.
Solution
Implement a polling-based monitoring system as the default monitoring mode with flexible deployment options.
Key Features
Architecture Overview
Two Deployment Options:
Embedded Mode (Default): Monitoring runs as a component within the operator pod
Standalone Mode: Monitoring runs as a separate deployment
Data Flow:
The cache is key: it decouples Prometheus scraping (every 15-30s) from BMC polling (every > 120s), preventing excessive load on BMCs.
What Gets Monitored
Monitoring Modes
Polling (default): Actively queries BMC endpoints every 60-120 seconds
Events: Uses Redfish event subscriptions (existing implementation)
Mode is configured globally for all BMCs via operator CLI flag.
Implementation Plan
Phase 1: Foundation
Phase 2: Data Collection
Phase 3: Integration
Phase 4: Quality
Acceptance Criteria
Sub-Issues
Will be created to break down implementation into manageable tasks across 4 phases.
Non-Goals
References
internal/serverevents/internal/controller/bmc_controller.gobmc/redfish.go