-
Notifications
You must be signed in to change notification settings - Fork 50
Description
SLURM Health Management Architecture
Problem Statement
Slow Filestore Issue → Fast health checks without filesystem dependency
Filestore is slower than SSD disks (shouldn't be RWM).
Slurm failover requires a shared file system when running two replicas of slurmctld.
For Slurm failover, it is currently necessary to use a shared file system for both SlurmdSpoolDir and StateSaveLocation, which complicates the failover system setup and reduces the write performance of job state data.
DNS and Health Check Failures → Direct ping/sinfo checks + kubelet verification via K8s API
The controllers' health checks rely on the state of the spool directory and DNS records. We had several incidents where the controllers couldn't fail over due to DNS loss or when the primary controller was responding very slowly — in such cases, the entire cluster becomes non-functional.
Job Loss on Controller Crash → Controlled pod deletion with grace period = 0
If the controllers have independent PVCs, there may be a situation where a job is in the queue, but the active controller crashes — in that case, the backup controller would lose information about the running jobs.
Kubelet Down ≠ Container Down → Separate health checks for SLURM processes
kubelet down is not means that container is down too - [Issue #1099](openkruise/kruise#1099)
StatefulSet on NotReady Node → Forced node deletion via operator
If a pod managed by a StatefulSet ends up on a node that is NotReady, it will not be recreated.
TerminationGracePeriodSeconds=0 Safety → Only after confirmed failure through multiple checks
Setting TerminationGracePeriodSeconds to 0 is considered a dangerous operation because Kubernetes can no longer guarantee that there will be only one instance of the pod running at a given time. This is due to the fact that kubelet NotReady does not necessarily mean that the controller pod is not functioning. This needs to be handled as a separate step or with a dedicated controller - [Issue #74947](kubernetes/kubernetes#74947)
Incomplete Liveness Probes → Comprehensive ping + sinfo + kubelet status checks
The liveness probe does not cover some failures.
Solution: SLURM Health Management Architecture
System Overview
The system consists of four main components that work together to ensure high availability of the SLURM cluster in Kubernetes:
- Soperator SlurmCluster Controller (Primary cluster controller)
- Health Check Controller (Primary health monitoring controller)
- Node Operator Controller (Node lifecycle management controller, part of soperatorchecks)
Architecture Components
1. Soperator SlurmCluster Controller
Purpose: Creates the Slurm controller.
Functions:
- Creates the Slurm cluster with a single replica
- Creates placeholder pods if the size is greater than one. These pods don't perform any actions and have lower priority than the main container. Their purpose is to speed up image loading when the controller pod is relocated, in order to avoid downtime (e.g., in case the registry becomes unavailable or the network fails)
- The main pod should have a liveness probe with
scontrol pingin case the Health Check Controller becomes unavailable for any reason
2. Health Check Controller
Purpose: Monitor the health of SLURM controllers and kubelets
Functions:
- Check SLURM controller availability via ping
- Verify SLURM controller response to
sinfocommand - Monitor kubelet status on nodes
- Update Custom Resources with health information
- Detect unresponsive SLURM controllers
- Force delete pods
- Prevent job queue data loss
Health Checks:
Check Algorithm:
- Every n seconds (15 by default), check SLURM controller availability
- If ping fails OR sinfo doesn't respond → mark pod as unhealthy
- Check kubelet status via Kubernetes API
- Record results in
SlurmControllerHealthCR
3. Node Operator Controller
Purpose: Manage node lifecycle based on their health status
Functions:
- Track nodes in NotReady state of
SlurmControllerHealthCR - Count failure occurrences on nodes within time windows
- Delete problematic nodes with protection against frequent deletions
- Manage cooldown periods and counters
Workflow Logic:
Counters and Timeouts:
- Failure Threshold: 3 failures within 15 seconds
- Delete Cooldown: 30 minutes between node deletions
- Max Deletions: Maximum 2 node deletions within 24 hours
- Manual Reset: Via annotation
health.slurm.nebius.io/manual-reset=true
Custom Resources
SlurmControllerHealth
apiVersion: health.slurm.io/v1
kind: SlurmControllerHealth
metadata:
name: slurm-controller-primary
spec:
checkInterval: 50s
status:
nodeName: worker-node-1
podName: slurm-controller-primary-0
lastPingCheck: "2025-01-28T10:00:00Z"
lastSinfoCheck: "2025-01-28T10:00:00Z"
pingHealthy: true
sinfoHealthy: false
overallHealthy: false
consecutiveFailures: 3
Operating Scenarios
Scenario 1: Kubelet NotReady
- Health Check Controller detects NotReady node
- Waits 15 sec (configurable)
- Node Operator Controller checks failure counters
- If threshold exceeded - set conditions to node (soperatorchecks delete it. Cause it has already rights and code for that)
- Pod Lifecycle Controller recreates pods on healthy nodes
Scenario 2: SLURM Controller Unresponsive
- Health Check Controller tests ping and sinfo
- On failure of both checks - marks pod as unhealthy
- Pod Lifecycle Controller force deletes the pod
- StatefulSet automatically recreates the pod
Scenario 3: Frequent Node Failures
- Node Operator Controller tracks deletion count
- When limit exceeded (2 within 24 hours) - blocks deletions
- Requires manual reset via annotation
- Administrator can reset counter after analysis
Definition of Done
- The controller stores its state on SSD disks
- There are health checks to monitor the controller's functionality, and the pod and/or Kubernetes node are recreated if the checks fail
- The controller pod has a high scheduling priority
- Additionally, but not necessarily, add the CR schema to kube-state-metrics so that the status can be transformed into metrics
Sub-issues
Metadata
Metadata
Assignees
Labels
Type
Projects
Status