Skip to content
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
This repository was archived by the owner on Sep 19, 2022. It is now read-only.

PytorchJob DDP training will stop if I delete a worker pod #364

Open
@Shuai-Xie

Description

@Shuai-Xie

Hi, everyone.

I want to test the failure tolerance of PytorchJob.

I started a PytorchJob with 1 master and 3 workers.

$ kubectl get pods -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP           NODE
mnist-ddp-master-0   1/1     Running   0          2m55s   11.80.0.36   11.71.1.160
mnist-ddp-worker-0   1/1     Running   0          2m55s   11.80.0.37   11.71.1.160
mnist-ddp-worker-1   1/1     Running   0          2m55s   11.80.0.38   11.71.1.160
mnist-ddp-worker-2   1/1     Running   0          89s     11.80.0.46   11.71.1.160

It trains fine.

Then I deleted a worker.

$ kubectl delete pod mnist-ddp-worker-1

As I set restartPolicy: OnFailure, this pod will restart quickly with the same name mnist-ddp-worker-1.

But sadly, I can't see this newborn worker join the DDP training.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions