Skip to content

Add Outage Time and Draining Time metrics for nodes #1400

@theyoprst

Description

@theyoprst

Add Outage Time and Draining Time metrics for nodes

Summary

Implement new Prometheus metrics to track the time nodes spend in outage and draining states, enabling better observability of cluster recovery times and drain operations.

Problem

Currently, we only track the count of node failures via slurm_node_fails_total. We need visibility into:

  1. How long nodes remain in outage states before being restored
  2. How long nodes spend draining (finishing existing jobs while not accepting new ones)

These metrics are critical for understanding cluster reliability and recovery performance.

Requirements

Node State Definitions

  • Outage state: A node is considered in outage when in state DOWN+* or IDLE+DRAIN+* (where * represents any additional flags)
  • Draining state: A node is considered draining when in state DRAIN+ALLOC+* or DRAIN+MIXED+* (where * represents any additional flags)
  • Note: Draining state is promoted to outage state when either the job stops running or the node goes offline. These states are mutually exclusive.

New Metrics

  • slurm_node_outage_time_seconds - Gauge tracking completed recovery duration for nodes (with node_name label)
  • slurm_node_draining_time_seconds - Gauge tracking how long nodes were in draining state (with node_name label)

Implementation Details

The exporter will track node state transitions in memory:

  • For outage time: Calculate the duration when a node transitions from outage to non-outage state
  • For draining time: Calculate the duration when a node exits the draining state

Both gauges will report the duration of the most recent event for each node.

Important: This solution provides an approximation of the real outage and draining times. When the exporter restarts, we lose the stored timestamps and will consider any ongoing outage/draining as if it just started. This is a tradeoff between accuracy and implementation simplicity, as we avoid introducing persistent state to the exporter.

This approach allows tracking of individual node metrics while being resilient to exporter restarts (though some events may be lost during restarts, which is acceptable).

Other things to remember

  • Update documentation
  • ...

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions