-
Notifications
You must be signed in to change notification settings - Fork 263
Description
Describe the bug
When the Slurm GCP controller's resume.py script encounters a GCP API error during instance creation (specifically bulkInsert), such as ZONE_RESOURCE_POOL_EXHAUSTED (stockout), it calls down_nodes_notify_jobs. This function explicitly executes scontrol update nodename=... state=down.
This behavior permanently marks the node as DOWN in Slurm, requiring manual administrator intervention (scontrol update node=... state=resume) to make the node available again. This defeats the purpose of an autoscaling cluster during transient errors like stockouts, as Slurm cannot retry provisioning the node or allocate a different node for the pending job once the node is marked DOWN.
Steps to reproduce
Steps to reproduce the behavior:
- Configure a Slurm cluster using the
schedmd-slurm-gcp-v6-controllermodule. - Submit a batch job that requests nodes in a region/zone currently experiencing high demand or stockouts (to trigger
ZONE_RESOURCE_POOL_EXHAUSTED). - Observe the behavior of the
resume.pyscript and the node state inslurmctldwhen the API call fails.
Expected behavior
If provisioning fails due to a cloud provider error (especially a capacity issue), the node should be marked as POWER_DOWN. This would allow Slurm's power management logic to put the node back into the idle pool (IDLE~), enabling the scheduler to retry provisioning later or select a different node for the job automatically.
Actual behavior
The node is marked as DOWN with the reason set to the GCP API error message. The node remains in this state indefinitely until an admin manually fixes it.
Version (gcluster --version)
v1.70.0
Blueprint
NA
Expanded Blueprint
NA
Output and logs
[2025-12-04T15:11:25.071] update_node: node smallslurm-a21-0 reason set to: GCP Error: ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS: The zone 'projects/sq-cash-ml-slurm-prod/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'.
[2025-12-04T15:11:25.071] Killing JobId=16571 on failed node smallslurm-a21-0
[2025-12-04T15:11:25.072] update_node: node smallslurm-a21-0 state set to DOWN
...
[2025-12-04T15:18:57.028] node smallslurm-a21-0 not resumed by ResumeTimeout(600), setting DOWN and POWERED_DOWN
Execution environment
- OS: Linux (Slurm Controller)
- Shell: bash
- go version: N/A
Additional context
The issue is located in resume.py in the down_nodes_notify_jobs function.
Current code:
run(f"{lookup().scontrol} update nodename={nodelist} state=down reason={reason_quoted}", check=False)Proposed fix is to change state to POWER_DOWN.