This repository was archived by the owner on Sep 19, 2022. It is now read-only.
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
Pytorch version may have an effect on the training reproduction #355
Open
Description
I try to figure out why Bare Metal (BM) and PytorchJob (PJ) have different training results in #354 (comment).
And now I find that PytorchJon v1.8.0 and 1.9.0 have different training results both on BM and PJ.
Experiment settings
- Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
- DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
- set random seed=1
BM
# torch 1.8.0+cu111
# torchvision 0.9.0+cu111
Train Epoch: 0 [0/30] loss=2.5691
Train Epoch: 0 [10/30] loss=2.2320
Train Epoch: 0 [20/30] loss=0.8108
Test Epoch: 0 [0/40] acc=33.5938
Test Epoch: 0 [10/40] acc=35.5469
Test Epoch: 0 [20/40] acc=34.7098
Test Epoch: 0 [30/40] acc=35.0302
Test Epoch: 0, acc=35.7200
test acc: 35.72, best acc: 35.72
training seconds: 19.506625175476074
best_acc: 35.72
# torch 1.9.0+cu111
# torchvision 0.10.0+cu111
Train Epoch: 0 [0/30] loss=2.5137
Train Epoch: 0 [10/30] loss=2.4295
Train Epoch: 0 [20/30] loss=0.9048
Test Epoch: 0 [0/40] acc=63.2812
Test Epoch: 0 [10/40] acc=64.9858
Test Epoch: 0 [20/40] acc=63.8021
Test Epoch: 0 [30/40] acc=63.9365
Test Epoch: 0, acc=64.1200
test acc: 64.12, best acc: 64.12
training seconds: 18.64181399345398
best_acc: 64.12
PJ
I build docker images from different versions of the PyTorch base images.
# FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30] loss=2.5691
Train Epoch: 0 [10/30] loss=2.5132
Train Epoch: 0 [20/30] loss=0.7198
Test Epoch: 0 [0/40] acc=38.2812
Test Epoch: 0 [10/40] acc=40.9091
Test Epoch: 0 [20/40] acc=39.8996
Test Epoch: 0 [30/40] acc=40.4738
Test Epoch: 0, acc=40.9600
test acc: 40.96, best acc: 40.96
training seconds: 20.630347967147827
best_acc: 40.96
# FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30] loss=2.5137
Train Epoch: 0 [10/30] loss=2.3939
Train Epoch: 0 [20/30] loss=0.6989
Test Epoch: 0 [0/40] acc=67.5781
Test Epoch: 0 [10/40] acc=69.2827
Test Epoch: 0 [20/40] acc=68.4152
Test Epoch: 0 [30/40] acc=67.8805
Test Epoch: 0, acc=67.9700
test acc: 67.97, best acc: 67.97
training seconds: 26.458710193634033
best_acc: 67.97
Please let me know if I write the wrong code. I've posted my code here: https://github.com/Shuai-Xie/mnist-pytorchjob-example.
Metadata
Metadata
Assignees
Labels
No labels