Skip to content

Nested slurm tasks can occasionally cause outer job to hang, even when inner job completes #110

@reesekneeland

Description

@reesekneeland

Hello! I am running into an issue with running multiple nested slurm processes. If I have an outer slurm process (in this case a model training job) that calls another slurm job internally (in this case a job to preprocess and prepare data), and I run multiple of these jobs at the same time via a grid-search style submission loop with different training parameters, the inner jobs (which are identical) will conflict and one of them will fail/be cancelled, causing the outer training job to hang forever as it waits for the inner job (which is now cancelled) to complete.

I am not sure of the exact mechanism causing the issue, it could be this logic? The symptom I observe is that when I submit a multi-job grid search script that launches 80 jobs, ~20 of them will hang forever with this issue, and in my system logs I can see that the inner preprocessing jobs for these get cancelled only 5 seconds after submission. As a debugging step I tried adding a sleep(1) statement in my job submission loop to try and give a buffer for jobs to not overlap, but that did not help.

Another thing that I have noticed that may be related, whenever I have nested slurm processes I tend to get lots of these warning messages in my log, any idea as to what might be causing them? Is there some configuration in my slurm cluster that I need to change?

submitit WARNING (2025-09-11 09:31:03,202) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
2025-09-11T09:31:03.202337 - WARNING - submitit:135 [slurm_51557] - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2025-09-11 09:31:06,454) - Call #4 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
2025-09-11T09:31:06.454158 - WARNING - submitit:135 [slurm_51557] - Call #4 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2025-09-11 09:31:12,497) - Call #5 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
2025-09-11T09:31:12.497061 - WARNING - submitit:135 [slurm_51557] - Call #5 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2025-09-11 09:31:24,544) - Call #6 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.

In my slurm .err file these look like:

submitit WARNING (2025-09-11 09:31:01,161) - Call #2 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
submitit WARNING (2025-09-11 09:31:03,202) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
submitit WARNING (2025-09-11 09:31:06,454) - Call #4 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
submitit WARNING (2025-09-11 09:31:12,497) - Call #5 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions