Different DDP training results of PytorchJob and Bare Metal

Dear developers, I got a new problem.

I've compared the DDP training process of PytorchJob (PJ) and Bare Metal (BM) and got different training results.

## Experiment settings
- Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
- DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
- set random seed=1


## Experiment results

### 1. BM DDP training

I record the training process of three ways to launch DDP training.
- `torch.distributed.launch` with default init_method `env://`
- `mp.spawn()` with `tcp://`
- `mp.spawn()` with `file://`

And the results are below.

#### (1.1) 2 machine，nproc_per_node=4，nnodes=2

```sh
# launch
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 37.32453942298889		# 18.465958833694458
best_acc: 64.12

# mp tcp
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 41.56801748275757
best_acc: 64.12

# mp file
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=0 --dist-url="file:///export/nfs/xs/codes/pytorch_operator_example/sharedfile"  --epochs=1 --batch-size=256
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=1 --dist-url="file:///export/nfs/xs/codes/pytorch_operator_example/sharedfile"  --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 41.899426221847534
best_acc: 64.12
```
#### (1.2) 2 machine，nproc_per_node=2，nnodes=4

```sh
# launch
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=0 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=1 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=2 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=3 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 38.14672040939331
best_acc: 64.12

# mp tcp
CUDA_VISIBLE_DEVICES=0,1 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0,1 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=2 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=3 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 34.46080470085144
best_acc: 64.12
```

#### (1.3) 2 machine，nproc_per_node=1，nnodes=8

```sh
# mp tcp
CUDA_VISIBLE_DEVICES=0 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=1 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=2 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=3 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=3 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=4 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=1 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=5 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=6 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=3 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=7 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 42.66786456108093
best_acc: 64.12
```



The 3 experiments above show that

- **When the total processes (`nproc_per_node * nnodes`) are equal (e.g. 8 in this setting), the training process has no relation with the number of distributed nodes `nnodes`**. Because the training loss is reproduced and the test accuracies are equal.



### 2. PJ DDP traing

When using PJ DDP training, I also want to see the same results of BM.

However, the experiment results makes me confused.



Before doing the same experiment group like BM, I use the recommended way to launch DDP training.

The YAML file is blow.

```yaml
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "mnist-ddp"
  namespace: "default"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python", "mnist_ddp_launch.py", "--epochs=1", "--batch-size=256",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 1
          hostIPC: true
          hostNetwork: true
          dnsPolicy: "ClusterFirstWithHostNet"
    Worker:
      replicas: 7
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python", "mnist_ddp_launch.py", "--epochs=1", "--batch-size=256",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 1
          hostIPC: true
          hostNetwork: true
          dnsPolicy: "ClusterFirstWithHostNet"
```

It launch 8 pods, which is similar to the experiment (1.3).

However, I get the results below, which is quite different from BM results.

```sh
# pod network
Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 28.12745976448059
best_acc: 67.97

# host network
Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 27.12745976448059
best_acc: 67.97
```



At first, I doubt BM OS and PytorchJob Pod OS generates different random states.

However, the following experiments show it's not the key.

We set `hostNetwork=ture` in all the experiments below.



#### (2.1) 2 Pod * 4 cards

```sh
# launch.py
# container command
[
  "sh",
  "-c",
  "python -m torch.distributed.launch --nnodes=2 --nproc_per_node=4 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.9048			# same to all BM results
training seconds: 48.71152639389038
best_acc: 64.12

# mp tcp
# container command
[
  "sh",
  "-c",
  "python mnist_ddp_mp.py --nnodes=2 --nproc_per_node=4 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.9048			# same to all BM results
training seconds: 51.17721652984619
best_acc: 64.12
```



#### (2.2) 4 Pod * 2 cards

```sh
# launch.py
# container command
[
  "sh",
  "-c",
  "python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.8723
training seconds: 48.09228801727295
best_acc: 39.76

# mp tcp
# container command
[
  "sh",
  "-c",
  "python mnist_ddp_mp.py --nnodes=4 --nproc_per_node=2 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.6989			
training seconds: 52.30190896987915
best_acc: 67.97
```



#### (2.3) 8 Pod * 1 cards

```sh
# launch.py
# container command
[
  "sh",
  "-c",
  "python -m torch.distributed.launch --nnodes=8 --nproc_per_node=1 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 26.12745976448059
best_acc: 67.97

# mp tcp
# container command
[
  "sh",
  "-c",
  "python mnist_ddp_mp.py --nnodes=8 --nproc_per_node=1 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.8723
training seconds: 52.18285155296326
best_acc: 39.76
```



Only exp (2.1) gets the same results like BM. It really makes me confused. 

Dear developers, please let me know if I made some mistakes.

Thanks a lot. 



Happy Mid-Autumn Festival!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different DDP training results of PytorchJob and Bare Metal #354

Experiment settings

Experiment results

1. BM DDP training

(1.1) 2 machine，nproc_per_node=4，nnodes=2

(1.2) 2 machine，nproc_per_node=2，nnodes=4

(1.3) 2 machine，nproc_per_node=1，nnodes=8

2. PJ DDP traing

(2.1) 2 Pod * 4 cards

(2.2) 4 Pod * 2 cards

(2.3) 8 Pod * 1 cards

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Different DDP training results of PytorchJob and Bare Metal #354

Description

Experiment settings

Experiment results

1. BM DDP training

(1.1) 2 machine，nproc_per_node=4，nnodes=2

(1.2) 2 machine，nproc_per_node=2，nnodes=4

(1.3) 2 machine，nproc_per_node=1，nnodes=8

2. PJ DDP traing

(2.1) 2 Pod * 4 cards

(2.2) 4 Pod * 2 cards

(2.3) 8 Pod * 1 cards

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions