Different DDP training results of PytorchJob and Bare Metal #354
Description
Dear developers, I got a new problem.
I've compared the DDP training process of PytorchJob (PJ) and Bare Metal (BM) and got different training results.
Experiment settings
- Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
- DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
- set random seed=1
Experiment results
1. BM DDP training
I record the training process of three ways to launch DDP training.
torch.distributed.launch
with default init_methodenv://
mp.spawn()
withtcp://
mp.spawn()
withfile://
And the results are below.
(1.1) 2 machine,nproc_per_node=4,nnodes=2
# launch
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
Train Epoch: 0 [20/30] loss=0.9048
training seconds: 37.32453942298889 # 18.465958833694458
best_acc: 64.12
# mp tcp
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
Train Epoch: 0 [20/30] loss=0.9048
training seconds: 41.56801748275757
best_acc: 64.12
# mp file
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=0 --dist-url="file:///export/nfs/xs/codes/pytorch_operator_example/sharedfile" --epochs=1 --batch-size=256
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=1 --dist-url="file:///export/nfs/xs/codes/pytorch_operator_example/sharedfile" --epochs=1 --batch-size=256
Train Epoch: 0 [20/30] loss=0.9048
training seconds: 41.899426221847534
best_acc: 64.12
(1.2) 2 machine,nproc_per_node=2,nnodes=4
# launch
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=0 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=1 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=2 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=3 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
Train Epoch: 0 [20/30] loss=0.9048
training seconds: 38.14672040939331
best_acc: 64.12
# mp tcp
CUDA_VISIBLE_DEVICES=0,1 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0,1 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=2 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=3 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
Train Epoch: 0 [20/30] loss=0.9048
training seconds: 34.46080470085144
best_acc: 64.12
(1.3) 2 machine,nproc_per_node=1,nnodes=8
# mp tcp
CUDA_VISIBLE_DEVICES=0 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=1 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=2 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=3 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=3 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=4 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=1 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=5 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=6 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=3 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=7 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
Train Epoch: 0 [20/30] loss=0.9048
training seconds: 42.66786456108093
best_acc: 64.12
The 3 experiments above show that
- When the total processes (
nproc_per_node * nnodes
) are equal (e.g. 8 in this setting), the training process has no relation with the number of distributed nodesnnodes
. Because the training loss is reproduced and the test accuracies are equal.
2. PJ DDP traing
When using PJ DDP training, I also want to see the same results of BM.
However, the experiment results makes me confused.
Before doing the same experiment group like BM, I use the recommended way to launch DDP training.
The YAML file is blow.
apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
name: "mnist-ddp"
namespace: "default"
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: shuaix/pytorch-dist-mnist:1.0
imagePullPolicy: IfNotPresent
command:
[
"python", "mnist_ddp_launch.py", "--epochs=1", "--batch-size=256",
]
resources:
limits:
nvidia.com/gpu: 1
hostIPC: true
hostNetwork: true
dnsPolicy: "ClusterFirstWithHostNet"
Worker:
replicas: 7
restartPolicy: Never
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: shuaix/pytorch-dist-mnist:1.0
imagePullPolicy: IfNotPresent
command:
[
"python", "mnist_ddp_launch.py", "--epochs=1", "--batch-size=256",
]
resources:
limits:
nvidia.com/gpu: 1
hostIPC: true
hostNetwork: true
dnsPolicy: "ClusterFirstWithHostNet"
It launch 8 pods, which is similar to the experiment (1.3).
However, I get the results below, which is quite different from BM results.
# pod network
Train Epoch: 0 [20/30] loss=0.6989
training seconds: 28.12745976448059
best_acc: 67.97
# host network
Train Epoch: 0 [20/30] loss=0.6989
training seconds: 27.12745976448059
best_acc: 67.97
At first, I doubt BM OS and PytorchJob Pod OS generates different random states.
However, the following experiments show it's not the key.
We set hostNetwork=ture
in all the experiments below.
(2.1) 2 Pod * 4 cards
# launch.py
# container command
[
"sh",
"-c",
"python -m torch.distributed.launch --nnodes=2 --nproc_per_node=4 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]
Train Epoch: 0 [20/30] loss=0.9048 # same to all BM results
training seconds: 48.71152639389038
best_acc: 64.12
# mp tcp
# container command
[
"sh",
"-c",
"python mnist_ddp_mp.py --nnodes=2 --nproc_per_node=4 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]
Train Epoch: 0 [20/30] loss=0.9048 # same to all BM results
training seconds: 51.17721652984619
best_acc: 64.12
(2.2) 4 Pod * 2 cards
# launch.py
# container command
[
"sh",
"-c",
"python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]
Train Epoch: 0 [20/30] loss=0.8723
training seconds: 48.09228801727295
best_acc: 39.76
# mp tcp
# container command
[
"sh",
"-c",
"python mnist_ddp_mp.py --nnodes=4 --nproc_per_node=2 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]
Train Epoch: 0 [20/30] loss=0.6989
training seconds: 52.30190896987915
best_acc: 67.97
(2.3) 8 Pod * 1 cards
# launch.py
# container command
[
"sh",
"-c",
"python -m torch.distributed.launch --nnodes=8 --nproc_per_node=1 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]
Train Epoch: 0 [20/30] loss=0.6989
training seconds: 26.12745976448059
best_acc: 67.97
# mp tcp
# container command
[
"sh",
"-c",
"python mnist_ddp_mp.py --nnodes=8 --nproc_per_node=1 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]
Train Epoch: 0 [20/30] loss=0.8723
training seconds: 52.18285155296326
best_acc: 39.76
Only exp (2.1) gets the same results like BM. It really makes me confused.
Dear developers, please let me know if I made some mistakes.
Thanks a lot.
Happy Mid-Autumn Festival!