Graceful shutdown when using DDP on SLURM #20649
Unanswered
Unturned3
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
How can we gracefully terminate a Lightning DDP training run on SLURM? Simply doing
scancel <jobid>
doesn't seem to do a "graceful" shutdown like how Ctrl-C would do in an interactive, single-GPU case.I noticed things like Weights & Biases will think the run is still alive (and later display "Crashed") instead of correctly displaying "Finished" (like it would after Ctrl-C).
In general, I'm confused about the handling of graceful shutdowns in Lightning; The documentation seems quite sparse on this issue. Thanks in advance for any help or suggestions!
Beta Was this translation helpful? Give feedback.
All reactions