-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Add Outage Time and Draining Time metrics for nodes
Summary
Implement new Prometheus metrics to track the time nodes spend in outage and draining states, enabling better observability of cluster recovery times and drain operations.
Problem
Currently, we only track the count of node failures via slurm_node_fails_total. We need visibility into:
- How long nodes remain in outage states before being restored
- How long nodes spend draining (finishing existing jobs while not accepting new ones)
These metrics are critical for understanding cluster reliability and recovery performance.
Requirements
Node State Definitions
- Outage state: A node is considered in outage when in state
DOWN+*orIDLE+DRAIN+*(where*represents any additional flags) - Draining state: A node is considered draining when in state
DRAIN+ALLOC+*orDRAIN+MIXED+*(where*represents any additional flags) - Note: Draining state is promoted to outage state when either the job stops running or the node goes offline. These states are mutually exclusive.
New Metrics
slurm_node_outage_time_seconds- Gauge tracking completed recovery duration for nodes (withnode_namelabel)slurm_node_draining_time_seconds- Gauge tracking how long nodes were in draining state (withnode_namelabel)
Implementation Details
The exporter will track node state transitions in memory:
- For outage time: Calculate the duration when a node transitions from outage to non-outage state
- For draining time: Calculate the duration when a node exits the draining state
Both gauges will report the duration of the most recent event for each node.
Important: This solution provides an approximation of the real outage and draining times. When the exporter restarts, we lose the stored timestamps and will consider any ongoing outage/draining as if it just started. This is a tradeoff between accuracy and implementation simplicity, as we avoid introducing persistent state to the exporter.
This approach allows tracking of individual node metrics while being resilient to exporter restarts (though some events may be lost during restarts, which is acceptable).
Other things to remember
- Update documentation
- ...
Metadata
Metadata
Assignees
Labels
Type
Projects
Status