multigpu_torchrun.py does not show speed up when training on multi GPUs!

I have tried the same example provided on [multigpu_torchrun.py](https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu_torchrun.py) and trained MNIST dataset and replaced the model with a simple CNN model. However, when increasing the number of GPUs in a single node the training time increases. 

I have tried doubling the batch size when doubling the number of GPUs, but I still cannot see time improvement. The total training time increases instead of decreasing. 

Again, my code is identical as the one in the repository. I would appreciate any help to identify the issue. Thanks in advance.

Here is the slurm file content I am using:

#SBATCH --job-name=4gp
#SBATCH --output=pytorch-DP-%j-%u-4gpu-64-slurm.out
#SBATCH --error=pytorch-DP-%j-%u-4gpu-64-slurm.err
#SBATCH --mem=24G                      # Job memory request
#SBATCH --gres=gpu:4                   # Number of requested GPU(s)
#SBATCH --time=3-23:00:00                  # Time limit days-hrs:min:sec
#SBATCH --constraint=rtx_6000           # Specific hardware constraint

nvidia-smi

torchrun --nnodes=1 --nproc_per_node=4 main_ddp.py 50 5 --batch_size 64



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions