-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Hello! I am running into an issue with running multiple nested slurm processes. If I have an outer slurm process (in this case a model training job) that calls another slurm job internally (in this case a job to preprocess and prepare data), and I run multiple of these jobs at the same time via a grid-search style submission loop with different training parameters, the inner jobs (which are identical) will conflict and one of them will fail/be cancelled, causing the outer training job to hang forever as it waits for the inner job (which is now cancelled) to complete.
I am not sure of the exact mechanism causing the issue, it could be this logic? The symptom I observe is that when I submit a multi-job grid search script that launches 80 jobs, ~20 of them will hang forever with this issue, and in my system logs I can see that the inner preprocessing jobs for these get cancelled only 5 seconds after submission. As a debugging step I tried adding a sleep(1) statement in my job submission loop to try and give a buffer for jobs to not overlap, but that did not help.
Another thing that I have noticed that may be related, whenever I have nested slurm processes I tend to get lots of these warning messages in my log, any idea as to what might be causing them? Is there some configuration in my slurm cluster that I need to change?
submitit WARNING (2025-09-11 09:31:03,202) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
2025-09-11T09:31:03.202337 - WARNING - submitit:135 [slurm_51557] - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2025-09-11 09:31:06,454) - Call #4 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
2025-09-11T09:31:06.454158 - WARNING - submitit:135 [slurm_51557] - Call #4 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2025-09-11 09:31:12,497) - Call #5 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
2025-09-11T09:31:12.497061 - WARNING - submitit:135 [slurm_51557] - Call #5 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
submitit WARNING (2025-09-11 09:31:24,544) - Call #6 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
In my slurm .err file these look like:
submitit WARNING (2025-09-11 09:31:01,161) - Call #2 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
submitit WARNING (2025-09-11 09:31:03,202) - Call #3 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
submitit WARNING (2025-09-11 09:31:06,454) - Call #4 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
submitit WARNING (2025-09-11 09:31:12,497) - Call #5 - Bypassing sacct error Command '['sacct', '-o', 'JobID,State,NodeList', '--parsable2', '-j', '51627', '-j', '51631']' returned non-zero exit status 1., status may be inaccurate.
sacct: error: _open_persist_conn: failed to open persistent connection to host:localhost:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused```