Skip to content

Nodes permanently marked DOWN on resume failure (e.g., stockouts) instead of returning to pool #4940

@casassg

Description

@casassg

Describe the bug

When the Slurm GCP controller's resume.py script encounters a GCP API error during instance creation (specifically bulkInsert), such as ZONE_RESOURCE_POOL_EXHAUSTED (stockout), it calls down_nodes_notify_jobs. This function explicitly executes scontrol update nodename=... state=down.

This behavior permanently marks the node as DOWN in Slurm, requiring manual administrator intervention (scontrol update node=... state=resume) to make the node available again. This defeats the purpose of an autoscaling cluster during transient errors like stockouts, as Slurm cannot retry provisioning the node or allocate a different node for the pending job once the node is marked DOWN.

Steps to reproduce

Steps to reproduce the behavior:

  1. Configure a Slurm cluster using the schedmd-slurm-gcp-v6-controller module.
  2. Submit a batch job that requests nodes in a region/zone currently experiencing high demand or stockouts (to trigger ZONE_RESOURCE_POOL_EXHAUSTED).
  3. Observe the behavior of the resume.py script and the node state in slurmctld when the API call fails.

Expected behavior

If provisioning fails due to a cloud provider error (especially a capacity issue), the node should be marked as POWER_DOWN. This would allow Slurm's power management logic to put the node back into the idle pool (IDLE~), enabling the scheduler to retry provisioning later or select a different node for the job automatically.

Actual behavior

The node is marked as DOWN with the reason set to the GCP API error message. The node remains in this state indefinitely until an admin manually fixes it.

Version (gcluster --version)

v1.70.0

Blueprint

NA

Expanded Blueprint

NA

Output and logs

[2025-12-04T15:11:25.071] update_node: node smallslurm-a21-0 reason set to: GCP Error: ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS: The zone 'projects/sq-cash-ml-slurm-prod/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'.
[2025-12-04T15:11:25.071] Killing JobId=16571 on failed node smallslurm-a21-0
[2025-12-04T15:11:25.072] update_node: node smallslurm-a21-0 state set to DOWN
...
[2025-12-04T15:18:57.028] node smallslurm-a21-0 not resumed by ResumeTimeout(600), setting DOWN and POWERED_DOWN

Execution environment

  • OS: Linux (Slurm Controller)
  • Shell: bash
  • go version: N/A

Additional context

The issue is located in resume.py in the down_nodes_notify_jobs function.
Current code:

run(f"{lookup().scontrol} update nodename={nodelist} state=down reason={reason_quoted}", check=False)

Proposed fix is to change state to POWER_DOWN.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions