Pytorch version may have an effect on the training reproduction

I try to figure out why Bare Metal (BM) and PytorchJob (PJ) have different training results in https://github.com/kubeflow/pytorch-operator/issues/354#issue-999999536.

And now I find that PytorchJon v1.8.0 and 1.9.0 have different training results both on BM and PJ.

## Experiment settings
- Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
- DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
- set random seed=1

## BM

```shell
# torch             1.8.0+cu111
# torchvision       0.9.0+cu111
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.2320
Train Epoch: 0 [20/30]  loss=0.8108
Test Epoch: 0 [0/40]    acc=33.5938
Test Epoch: 0 [10/40]   acc=35.5469
Test Epoch: 0 [20/40]   acc=34.7098
Test Epoch: 0 [30/40]   acc=35.0302
Test Epoch: 0, acc=35.7200
test acc: 35.72, best acc: 35.72
training seconds: 19.506625175476074
best_acc: 35.72

# torch             1.9.0+cu111
# torchvision       0.10.0+cu111
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.4295
Train Epoch: 0 [20/30]  loss=0.9048
Test Epoch: 0 [0/40]    acc=63.2812
Test Epoch: 0 [10/40]   acc=64.9858
Test Epoch: 0 [20/40]   acc=63.8021
Test Epoch: 0 [30/40]   acc=63.9365
Test Epoch: 0, acc=64.1200
test acc: 64.12, best acc: 64.12
training seconds: 18.64181399345398
best_acc: 64.12
```

## PJ

I build docker images from different versions of the PyTorch base images.

```sh
# FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.5132
Train Epoch: 0 [20/30]  loss=0.7198
Test Epoch: 0 [0/40]    acc=38.2812
Test Epoch: 0 [10/40]   acc=40.9091
Test Epoch: 0 [20/40]   acc=39.8996
Test Epoch: 0 [30/40]   acc=40.4738
Test Epoch: 0, acc=40.9600
test acc: 40.96, best acc: 40.96
training seconds: 20.630347967147827
best_acc: 40.96

# FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.3939
Train Epoch: 0 [20/30]  loss=0.6989
Test Epoch: 0 [0/40]    acc=67.5781
Test Epoch: 0 [10/40]   acc=69.2827
Test Epoch: 0 [20/40]   acc=68.4152
Test Epoch: 0 [30/40]   acc=67.8805
Test Epoch: 0, acc=67.9700
test acc: 67.97, best acc: 67.97
training seconds: 26.458710193634033
best_acc: 67.97
```

Please let me know if I write the wrong code. I've posted my code here: https://github.com/Shuai-Xie/mnist-pytorchjob-example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch version may have an effect on the training reproduction #355

Experiment settings

BM

PJ

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pytorch version may have an effect on the training reproduction #355

Description

Experiment settings

BM

PJ

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions