Nodes permanently marked DOWN on resume failure (e.g., stockouts) instead of returning to pool

### Describe the bug

When the Slurm GCP controller's `resume.py` script encounters a GCP API error during instance creation (specifically `bulkInsert`), such as `ZONE_RESOURCE_POOL_EXHAUSTED` (stockout), it calls `down_nodes_notify_jobs`. This function explicitly executes `scontrol update nodename=... state=down`.

This behavior permanently marks the node as `DOWN` in Slurm, requiring manual administrator intervention (`scontrol update node=... state=resume`) to make the node available again. This defeats the purpose of an autoscaling cluster during transient errors like stockouts, as Slurm cannot retry provisioning the node or allocate a different node for the pending job once the node is marked `DOWN`.

### Steps to reproduce

Steps to reproduce the behavior:
1. Configure a Slurm cluster using the `schedmd-slurm-gcp-v6-controller` module.
2. Submit a batch job that requests nodes in a region/zone currently experiencing high demand or stockouts (to trigger `ZONE_RESOURCE_POOL_EXHAUSTED`).
3. Observe the behavior of the `resume.py` script and the node state in `slurmctld` when the API call fails.

### Expected behavior


If provisioning fails due to a cloud provider error (especially a capacity issue), the node should be marked as `POWER_DOWN`. This would allow Slurm's power management logic to put the node back into the idle pool (`IDLE~`), enabling the scheduler to retry provisioning later or select a different node for the job automatically.

### Actual behavior

The node is marked as `DOWN` with the reason set to the GCP API error message. The node remains in this state indefinitely until an admin manually fixes it.

### Version (`gcluster --version`)

v1.70.0

### Blueprint

NA

### Expanded Blueprint

NA

### Output and logs

```text
[2025-12-04T15:11:25.071] update_node: node smallslurm-a21-0 reason set to: GCP Error: ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS: The zone 'projects/sq-cash-ml-slurm-prod/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'.
[2025-12-04T15:11:25.071] Killing JobId=16571 on failed node smallslurm-a21-0
[2025-12-04T15:11:25.072] update_node: node smallslurm-a21-0 state set to DOWN
...
[2025-12-04T15:18:57.028] node smallslurm-a21-0 not resumed by ResumeTimeout(600), setting DOWN and POWERED_DOWN

```


### Execution environment

- OS: Linux (Slurm Controller)
- Shell: bash
- go version: N/A

### Additional context

The issue is located in resume.py in the down_nodes_notify_jobs function. 
Current code:
```python
run(f"{lookup().scontrol} update nodename={nodelist} state=down reason={reason_quoted}", check=False)
```
Proposed fix is to change state to POWER_DOWN.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nodes permanently marked DOWN on resume failure (e.g., stockouts) instead of returning to pool #4940

Describe the bug

Steps to reproduce

Expected behavior

Actual behavior

Version (`gcluster --version`)

Blueprint

Expanded Blueprint

Output and logs

Execution environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nodes permanently marked DOWN on resume failure (e.g., stockouts) instead of returning to pool #4940

Description

Describe the bug

Steps to reproduce

Expected behavior

Actual behavior

Version (gcluster --version)

Blueprint

Expanded Blueprint

Output and logs

Execution environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Version (`gcluster --version`)