Skip to content
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
This repository was archived by the owner on Sep 19, 2022. It is now read-only.

Pytorch version may have an effect on the training reproduction #355

Open
@Shuai-Xie

Description

@Shuai-Xie

I try to figure out why Bare Metal (BM) and PytorchJob (PJ) have different training results in #354 (comment).

And now I find that PytorchJon v1.8.0 and 1.9.0 have different training results both on BM and PJ.

Experiment settings

  • Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
  • DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
  • set random seed=1

BM

# torch             1.8.0+cu111
# torchvision       0.9.0+cu111
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.2320
Train Epoch: 0 [20/30]  loss=0.8108
Test Epoch: 0 [0/40]    acc=33.5938
Test Epoch: 0 [10/40]   acc=35.5469
Test Epoch: 0 [20/40]   acc=34.7098
Test Epoch: 0 [30/40]   acc=35.0302
Test Epoch: 0, acc=35.7200
test acc: 35.72, best acc: 35.72
training seconds: 19.506625175476074
best_acc: 35.72

# torch             1.9.0+cu111
# torchvision       0.10.0+cu111
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.4295
Train Epoch: 0 [20/30]  loss=0.9048
Test Epoch: 0 [0/40]    acc=63.2812
Test Epoch: 0 [10/40]   acc=64.9858
Test Epoch: 0 [20/40]   acc=63.8021
Test Epoch: 0 [30/40]   acc=63.9365
Test Epoch: 0, acc=64.1200
test acc: 64.12, best acc: 64.12
training seconds: 18.64181399345398
best_acc: 64.12

PJ

I build docker images from different versions of the PyTorch base images.

# FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.5132
Train Epoch: 0 [20/30]  loss=0.7198
Test Epoch: 0 [0/40]    acc=38.2812
Test Epoch: 0 [10/40]   acc=40.9091
Test Epoch: 0 [20/40]   acc=39.8996
Test Epoch: 0 [30/40]   acc=40.4738
Test Epoch: 0, acc=40.9600
test acc: 40.96, best acc: 40.96
training seconds: 20.630347967147827
best_acc: 40.96

# FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.3939
Train Epoch: 0 [20/30]  loss=0.6989
Test Epoch: 0 [0/40]    acc=67.5781
Test Epoch: 0 [10/40]   acc=69.2827
Test Epoch: 0 [20/40]   acc=68.4152
Test Epoch: 0 [30/40]   acc=67.8805
Test Epoch: 0, acc=67.9700
test acc: 67.97, best acc: 67.97
training seconds: 26.458710193634033
best_acc: 67.97

Please let me know if I write the wrong code. I've posted my code here: https://github.com/Shuai-Xie/mnist-pytorchjob-example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions