Skip to content
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
This repository was archived by the owner on Sep 19, 2022. It is now read-only.

Different DDP training results of PytorchJob and Bare Metal #354

Open
@Shuai-Xie

Description

@Shuai-Xie

Dear developers, I got a new problem.

I've compared the DDP training process of PytorchJob (PJ) and Bare Metal (BM) and got different training results.

Experiment settings

  • Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
  • DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
  • set random seed=1

Experiment results

1. BM DDP training

I record the training process of three ways to launch DDP training.

  • torch.distributed.launch with default init_method env://
  • mp.spawn() with tcp://
  • mp.spawn() with file://

And the results are below.

(1.1) 2 machine,nproc_per_node=4,nnodes=2

# launch
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 37.32453942298889		# 18.465958833694458
best_acc: 64.12

# mp tcp
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 41.56801748275757
best_acc: 64.12

# mp file
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=0 --dist-url="file:///export/nfs/xs/codes/pytorch_operator_example/sharedfile"  --epochs=1 --batch-size=256
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=1 --dist-url="file:///export/nfs/xs/codes/pytorch_operator_example/sharedfile"  --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 41.899426221847534
best_acc: 64.12

(1.2) 2 machine,nproc_per_node=2,nnodes=4

# launch
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=0 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=1 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=2 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=3 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 38.14672040939331
best_acc: 64.12

# mp tcp
CUDA_VISIBLE_DEVICES=0,1 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0,1 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=2 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=3 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 34.46080470085144
best_acc: 64.12

(1.3) 2 machine,nproc_per_node=1,nnodes=8

# mp tcp
CUDA_VISIBLE_DEVICES=0 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=1 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=2 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=3 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=3 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=4 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=1 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=5 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=6 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=3 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=7 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 42.66786456108093
best_acc: 64.12

The 3 experiments above show that

  • When the total processes (nproc_per_node * nnodes) are equal (e.g. 8 in this setting), the training process has no relation with the number of distributed nodes nnodes. Because the training loss is reproduced and the test accuracies are equal.

2. PJ DDP traing

When using PJ DDP training, I also want to see the same results of BM.

However, the experiment results makes me confused.

Before doing the same experiment group like BM, I use the recommended way to launch DDP training.

The YAML file is blow.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "mnist-ddp"
  namespace: "default"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python", "mnist_ddp_launch.py", "--epochs=1", "--batch-size=256",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 1
          hostIPC: true
          hostNetwork: true
          dnsPolicy: "ClusterFirstWithHostNet"
    Worker:
      replicas: 7
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python", "mnist_ddp_launch.py", "--epochs=1", "--batch-size=256",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 1
          hostIPC: true
          hostNetwork: true
          dnsPolicy: "ClusterFirstWithHostNet"

It launch 8 pods, which is similar to the experiment (1.3).

However, I get the results below, which is quite different from BM results.

# pod network
Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 28.12745976448059
best_acc: 67.97

# host network
Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 27.12745976448059
best_acc: 67.97

At first, I doubt BM OS and PytorchJob Pod OS generates different random states.

However, the following experiments show it's not the key.

We set hostNetwork=ture in all the experiments below.

(2.1) 2 Pod * 4 cards

# launch.py
# container command
[
  "sh",
  "-c",
  "python -m torch.distributed.launch --nnodes=2 --nproc_per_node=4 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.9048			# same to all BM results
training seconds: 48.71152639389038
best_acc: 64.12

# mp tcp
# container command
[
  "sh",
  "-c",
  "python mnist_ddp_mp.py --nnodes=2 --nproc_per_node=4 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.9048			# same to all BM results
training seconds: 51.17721652984619
best_acc: 64.12

(2.2) 4 Pod * 2 cards

# launch.py
# container command
[
  "sh",
  "-c",
  "python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.8723
training seconds: 48.09228801727295
best_acc: 39.76

# mp tcp
# container command
[
  "sh",
  "-c",
  "python mnist_ddp_mp.py --nnodes=4 --nproc_per_node=2 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.6989			
training seconds: 52.30190896987915
best_acc: 67.97

(2.3) 8 Pod * 1 cards

# launch.py
# container command
[
  "sh",
  "-c",
  "python -m torch.distributed.launch --nnodes=8 --nproc_per_node=1 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 26.12745976448059
best_acc: 67.97

# mp tcp
# container command
[
  "sh",
  "-c",
  "python mnist_ddp_mp.py --nnodes=8 --nproc_per_node=1 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.8723
training seconds: 52.18285155296326
best_acc: 39.76

Only exp (2.1) gets the same results like BM. It really makes me confused.

Dear developers, please let me know if I made some mistakes.

Thanks a lot.

Happy Mid-Autumn Festival!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions