Add Outage Time and Draining Time metrics for nodes

# Add Outage Time and Draining Time metrics for nodes

## Summary
Implement new Prometheus metrics to track the time nodes spend in outage and draining states, enabling better observability of cluster recovery times and drain operations.

## Problem
Currently, we only track the count of node failures via `slurm_node_fails_total`. We need visibility into:
1. How long nodes remain in outage states before being restored
2. How long nodes spend draining (finishing existing jobs while not accepting new ones)

These metrics are critical for understanding cluster reliability and recovery performance.

## Requirements

### Node State Definitions
- **Outage state**: A node is considered in outage when in state `DOWN+*` or `IDLE+DRAIN+*` (where `*` represents any additional flags)
- **Draining state**: A node is considered draining when in state `DRAIN+ALLOC+*` or `DRAIN+MIXED+*` (where `*` represents any additional flags)
- **Note**: Draining state is promoted to outage state when either the job stops running or the node goes offline. These states are mutually exclusive.

### New Metrics
- **`slurm_node_outage_time_seconds`** - Gauge tracking completed recovery duration for nodes (with `node_name` label)
- **`slurm_node_draining_time_seconds`** - Gauge tracking how long nodes were in draining state (with `node_name` label)

## Implementation Details

The exporter will track node state transitions in memory:
- For outage time: Calculate the duration when a node transitions from outage to non-outage state
- For draining time: Calculate the duration when a node exits the draining state

Both gauges will report the duration of the most recent event for each node.

**Important**: This solution provides an approximation of the real outage and draining times. When the exporter restarts, we lose the stored timestamps and will consider any ongoing outage/draining as if it just started. This is a tradeoff between accuracy and implementation simplicity, as we avoid introducing persistent state to the exporter.

This approach allows tracking of individual node metrics while being resilient to exporter restarts (though some events may be lost during restarts, which is acceptable).

## Other things to remember

- Update documentation
- ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Outage Time and Draining Time metrics for nodes #1400

Add Outage Time and Draining Time metrics for nodes

Summary

Problem

Requirements

Node State Definitions

New Metrics

Implementation Details

Other things to remember

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Outage Time and Draining Time metrics for nodes #1400

Description

Add Outage Time and Draining Time metrics for nodes

Summary

Problem

Requirements

Node State Definitions

New Metrics

Implementation Details

Other things to remember

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions