-
Notifications
You must be signed in to change notification settings - Fork 133
Open
Description
Description
The _wait_for_job_start method in monarch/_src/job/slurm.py fails on SLURM clusters that use a different JSON format for reporting allocated nodes.
Current Behavior
The code assumes job_resources["nodes"] is a dictionary with an allocation key (SLURM 24.11+ format):
nodes_info = job_resources.get("nodes", {})
allocation = nodes_info.get("allocation", [])
hostnames = [node["name"] for node in allocation]This fails with AttributeError: 'str' object has no attribute 'get' on clusters where nodes_info is a string.
SLURM Version Tested
The issue was observed on a cluster running SLURM where squeue --json returns:
"job_resources": {
"nodes": "lrdn3456",
"allocated_cores": 1,
"allocated_cpus": 0,
"allocated_hosts": 1,
"allocated_nodes": [
{
"nodename": "lrdn3456",
"cpus_used": 0,
"memory_used": 0,
"memory_allocated": 15400
}
]
}Note that:
nodesis a string (node list), not a dictionary- Hostnames are in
allocated_nodes[*].nodename, notnodes.allocation[*].name
Expected Behavior
The code should handle multiple SLURM JSON formats:
- SLURM 24.11+:
nodes.allocation[*].name - Older SLURM:
allocated_nodes[*].nodename - Fallback: Parse
nodesstring usingscontrol show hostnames
Proposed Fix
if "RUNNING" in job_state:
job_resources = job_info.get("job_resources", {})
nodes_info = job_resources.get("nodes", {})
hostnames = []
if isinstance(nodes_info, dict) and "allocation" in nodes_info:
# SLURM 24.11+ format
allocation = nodes_info.get("allocation", [])
hostnames = [node["name"] for node in allocation]
elif "allocated_nodes" in job_resources:
# Older SLURM format
allocated_nodes = job_resources.get("allocated_nodes", [])
hostnames = [node["nodename"] for node in allocated_nodes]
elif isinstance(nodes_info, str) and nodes_info:
# Fallback: nodes is a string, expand using scontrol
import subprocess
result = subprocess.run(
['scontrol', 'show', 'hostnames', nodes_info],
capture_output=True, text=True
)
if result.returncode == 0:
hostnames = [h for h in result.stdout.strip().split('\n') if h]Environment
- Monarch version: torchmonarch-nightly
- SLURM version: slurm 23.11.10-BullSequana.1.2.1
Metadata
Metadata
Assignees
Labels
No labels