This repository was archived by the owner on Sep 19, 2022. It is now read-only.
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
The training hangs after reloading one of master/worker pods #359
Open
Description
Hello!
I'm setting up training with PyTorchJobs. I have the problem: if one of the pods (doesn't matter, master or worker) reloads, the whole process hangs. The reason for reloading can be different, usually, it's due to Google Cloud Engine node rescheduling. Also, I tried to kill pods myself - the behavior was the same.
Can I avoid this behavior and make training tolerant to pods' reloading?