[soperator] custom Slurm controller HA

# SLURM Health Management Architecture

## Problem Statement

### Slow Filestore Issue → Fast health checks without filesystem dependency
Filestore is slower than SSD disks (shouldn't be RWM).

### Slurm failover requires a shared file system when running two replicas of `slurmctld`.
For Slurm failover, it is currently necessary to use a shared file system for both `SlurmdSpoolDir` and `StateSaveLocation`, [which complicates the failover system setup and reduces the write performance of job state data.](https://slurm.schedmd.com/slurm.conf.html#OPT_StateSaveLocation)

### DNS and Health Check Failures → Direct ping/sinfo checks + kubelet verification via K8s API
The controllers' health checks rely on the state of the spool directory and DNS records. We had several incidents where the controllers couldn't fail over due to DNS loss or when the primary controller was responding very slowly — in such cases, the entire cluster becomes non-functional.

### Job Loss on Controller Crash → Controlled pod deletion with grace period = 0
If the controllers have independent PVCs, there may be a situation where a job is in the queue, but the active controller crashes — in that case, the backup controller would lose information about the running jobs.

### Kubelet Down ≠ Container Down → Separate health checks for SLURM processes
kubelet down is not means that container is down too - [[Issue #1099](https://github.com/openkruise/kruise/issues/1099)](https://github.com/openkruise/kruise/issues/1099)

### StatefulSet on NotReady Node → Forced node deletion via operator
If a pod managed by a StatefulSet ends up on a node that is `NotReady`, it will not be recreated.

### TerminationGracePeriodSeconds=0 Safety → Only after confirmed failure through multiple checks
Setting `TerminationGracePeriodSeconds` to 0 is considered a dangerous operation because Kubernetes can no longer guarantee that there will be only one instance of the pod running at a given time. This is due to the fact that `kubelet NotReady` does not necessarily mean that the controller pod is not functioning. This needs to be handled as a separate step or with a dedicated controller - [[Issue #74947](https://github.com/kubernetes/kubernetes/issues/74947)](https://github.com/kubernetes/kubernetes/issues/74947)

### Incomplete Liveness Probes → Comprehensive ping + sinfo + kubelet status checks
The liveness probe does not cover some failures.

---

## Solution: SLURM Health Management Architecture

### System Overview

The system consists of four main components that work together to ensure high availability of the SLURM cluster in Kubernetes:

1. **Soperator SlurmCluster Controller** (Primary cluster controller)
2. **Health Check Controller** (Primary health monitoring controller)
3. **Node Operator Controller** (Node lifecycle management controller, part of soperatorchecks)


---

## Architecture Components

### 1. Soperator SlurmCluster Controller

**Purpose**: Creates the Slurm controller.

**Functions**:
- Creates the Slurm cluster with a single replica
- Creates placeholder pods if the size is greater than one. These pods don't perform any actions and have lower priority than the main container. Their purpose is to speed up image loading when the controller pod is relocated, in order to avoid downtime (e.g., in case the registry becomes unavailable or the network fails)
- The main pod should have a liveness probe with `scontrol ping` in case the Health Check Controller becomes unavailable for any reason

<img width="1008" height="840" alt="Image" src="https://github.com/user-attachments/assets/576c7013-42ff-4d70-b782-7255258efdb0" />

### 2. Health Check Controller

**Purpose**: Monitor the health of SLURM controllers and kubelets

**Functions**:
- Check SLURM controller availability via ping
- Verify SLURM controller response to `sinfo` command
- Monitor kubelet status on nodes
- Update Custom Resources with health information
- Detect unresponsive SLURM controllers
- Force delete pods
- Prevent job queue data loss

**Health Checks**:

<img width="620" height="323" alt="Image" src="https://github.com/user-attachments/assets/1b5ef801-18f9-467a-974f-8376fe832b2f" />

<img width="1160" height="760" alt="Image" src="https://github.com/user-attachments/assets/2eb5ce4c-3776-4a9e-9f56-eafdf778c85e" />

**Check Algorithm**:
1. Every **n seconds** (15 by default), check SLURM controller availability
2. If ping fails OR sinfo doesn't respond → mark pod as unhealthy
3. Check kubelet status via Kubernetes API
4. Record results in `SlurmControllerHealth` CR

### 3. Node Operator Controller

**Purpose**: Manage node lifecycle based on their health status

**Functions**:
- Track nodes in NotReady state of `SlurmControllerHealth` CR
- Count failure occurrences on nodes within time windows
- Delete problematic nodes with protection against frequent deletions
- Manage cooldown periods and counters

**Workflow Logic**:

<img width="960" height="1148" alt="Image" src="https://github.com/user-attachments/assets/67a449b5-5b8a-414f-a347-9a829fb87187" />

**Counters and Timeouts**:
- **Failure Threshold**: 3 failures within 15 seconds
- **Delete Cooldown**: 30 minutes between node deletions
- **Max Deletions**: Maximum 2 node deletions within 24 hours
- **Manual Reset**: Via annotation `health.slurm.nebius.io/manual-reset=true`

---

## Custom Resources

### SlurmControllerHealth

```yaml
apiVersion: health.slurm.io/v1
kind: SlurmControllerHealth
metadata:
  name: slurm-controller-primary
spec:
  checkInterval: 50s
status:
  nodeName: worker-node-1
  podName: slurm-controller-primary-0
  lastPingCheck: "2025-01-28T10:00:00Z"
  lastSinfoCheck: "2025-01-28T10:00:00Z"
  pingHealthy: true
  sinfoHealthy: false
  overallHealthy: false
  consecutiveFailures: 3
```

<img width="1577" height="899" alt="Image" src="https://github.com/user-attachments/assets/38196cc7-50e9-4c12-b390-6662e065c0e7" />

## Operating Scenarios

### Scenario 1: Kubelet NotReady
1. Health Check Controller detects NotReady node
2. Waits 15 sec (configurable)
3. Node Operator Controller checks failure counters
4. If threshold exceeded - set conditions to node (soperatorchecks delete it. Cause it has already rights and code for that)
5. Pod Lifecycle Controller recreates pods on healthy nodes

### Scenario 2: SLURM Controller Unresponsive
1. Health Check Controller tests ping and sinfo
2. On failure of both checks - marks pod as unhealthy
3. Pod Lifecycle Controller force deletes the pod
4. StatefulSet automatically recreates the pod

### Scenario 3: Frequent Node Failures
1. Node Operator Controller tracks deletion count
2. When limit exceeded (2 within 24 hours) - blocks deletions
3. Requires manual reset via annotation
4. Administrator can reset counter after analysis

---

## Definition of Done

- [ ] The controller stores its state on SSD disks
- [ ] There are health checks to monitor the controller's functionality, and the pod and/or Kubernetes node are recreated if the checks fail
- [ ] The controller pod has a high scheduling priority
- [ ] Additionally, but not necessarily, add the CR schema to kube-state-metrics so that the status can be transformed into metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[soperator] custom Slurm controller HA #1325

SLURM Health Management Architecture

Problem Statement

Slow Filestore Issue → Fast health checks without filesystem dependency

Slurm failover requires a shared file system when running two replicas of `slurmctld`.

DNS and Health Check Failures → Direct ping/sinfo checks + kubelet verification via K8s API

Job Loss on Controller Crash → Controlled pod deletion with grace period = 0

Kubelet Down ≠ Container Down → Separate health checks for SLURM processes

StatefulSet on NotReady Node → Forced node deletion via operator

TerminationGracePeriodSeconds=0 Safety → Only after confirmed failure through multiple checks

Incomplete Liveness Probes → Comprehensive ping + sinfo + kubelet status checks

Solution: SLURM Health Management Architecture

System Overview

Architecture Components

1. Soperator SlurmCluster Controller

2. Health Check Controller

3. Node Operator Controller

Custom Resources

SlurmControllerHealth

Operating Scenarios

Scenario 1: Kubelet NotReady

Scenario 2: SLURM Controller Unresponsive

Scenario 3: Frequent Node Failures

Definition of Done

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[soperator] custom Slurm controller HA #1325

Description

SLURM Health Management Architecture

Problem Statement

Slow Filestore Issue → Fast health checks without filesystem dependency

Slurm failover requires a shared file system when running two replicas of slurmctld.

DNS and Health Check Failures → Direct ping/sinfo checks + kubelet verification via K8s API

Job Loss on Controller Crash → Controlled pod deletion with grace period = 0

Kubelet Down ≠ Container Down → Separate health checks for SLURM processes

StatefulSet on NotReady Node → Forced node deletion via operator

TerminationGracePeriodSeconds=0 Safety → Only after confirmed failure through multiple checks

Incomplete Liveness Probes → Comprehensive ping + sinfo + kubelet status checks

Solution: SLURM Health Management Architecture

System Overview

Architecture Components

1. Soperator SlurmCluster Controller

2. Health Check Controller

3. Node Operator Controller

Custom Resources

SlurmControllerHealth

Operating Scenarios

Scenario 1: Kubelet NotReady

Scenario 2: SLURM Controller Unresponsive

Scenario 3: Frequent Node Failures

Definition of Done

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Slurm failover requires a shared file system when running two replicas of `slurmctld`.