multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298
Description
I have tried the same example provided on multigpu_torchrun.py and trained MNIST dataset and replaced the model with a simple CNN model. However, when increasing the number of GPUs in a single node the training time increases.
I have tried doubling the batch size when doubling the number of GPUs, but I still cannot see time improvement. The total training time increases instead of decreasing.
Again, my code is identical as the one in the repository. I would appreciate any help to identify the issue. Thanks in advance.
Here is the slurm file content I am using:
#SBATCH --job-name=4gp
#SBATCH --output=pytorch-DP-%j-%u-4gpu-64-slurm.out
#SBATCH --error=pytorch-DP-%j-%u-4gpu-64-slurm.err
#SBATCH --mem=24G # Job memory request
#SBATCH --gres=gpu:4 # Number of requested GPU(s)
#SBATCH --time=3-23:00:00 # Time limit days-hrs:min:sec
#SBATCH --constraint=rtx_6000 # Specific hardware constraint
nvidia-smi
torchrun --nnodes=1 --nproc_per_node=4 main_ddp.py 50 5 --batch_size 64