Skip to content

SlurmJob._wait_for_job_start fails with older SLURM versions due to different JSON format for node allocation #2330

@DWarez

Description

@DWarez

Description

The _wait_for_job_start method in monarch/_src/job/slurm.py fails on SLURM clusters that use a different JSON format for reporting allocated nodes.

Current Behavior

The code assumes job_resources["nodes"] is a dictionary with an allocation key (SLURM 24.11+ format):

nodes_info = job_resources.get("nodes", {})
allocation = nodes_info.get("allocation", [])
hostnames = [node["name"] for node in allocation]

This fails with AttributeError: 'str' object has no attribute 'get' on clusters where nodes_info is a string.

SLURM Version Tested

The issue was observed on a cluster running SLURM where squeue --json returns:

"job_resources": {
  "nodes": "lrdn3456",
  "allocated_cores": 1,
  "allocated_cpus": 0,
  "allocated_hosts": 1,
  "allocated_nodes": [
    {
      "nodename": "lrdn3456",
      "cpus_used": 0,
      "memory_used": 0,
      "memory_allocated": 15400
    }
  ]
}

Note that:

  • nodes is a string (node list), not a dictionary
  • Hostnames are in allocated_nodes[*].nodename, not nodes.allocation[*].name

Expected Behavior

The code should handle multiple SLURM JSON formats:

  1. SLURM 24.11+: nodes.allocation[*].name
  2. Older SLURM: allocated_nodes[*].nodename
  3. Fallback: Parse nodes string using scontrol show hostnames

Proposed Fix

if "RUNNING" in job_state:
    job_resources = job_info.get("job_resources", {})
    nodes_info = job_resources.get("nodes", {})

    hostnames = []

    if isinstance(nodes_info, dict) and "allocation" in nodes_info:
        # SLURM 24.11+ format
        allocation = nodes_info.get("allocation", [])
        hostnames = [node["name"] for node in allocation]
    elif "allocated_nodes" in job_resources:
        # Older SLURM format
        allocated_nodes = job_resources.get("allocated_nodes", [])
        hostnames = [node["nodename"] for node in allocated_nodes]
    elif isinstance(nodes_info, str) and nodes_info:
        # Fallback: nodes is a string, expand using scontrol
        import subprocess
        result = subprocess.run(
            ['scontrol', 'show', 'hostnames', nodes_info],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            hostnames = [h for h in result.stdout.strip().split('\n') if h]

Environment

  • Monarch version: torchmonarch-nightly
  • SLURM version: slurm 23.11.10-BullSequana.1.2.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions