Skip to content

multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298

Open
@MostafaCham

Description

I have tried the same example provided on multigpu_torchrun.py and trained MNIST dataset and replaced the model with a simple CNN model. However, when increasing the number of GPUs in a single node the training time increases.

I have tried doubling the batch size when doubling the number of GPUs, but I still cannot see time improvement. The total training time increases instead of decreasing.

Again, my code is identical as the one in the repository. I would appreciate any help to identify the issue. Thanks in advance.

Here is the slurm file content I am using:

#SBATCH --job-name=4gp
#SBATCH --output=pytorch-DP-%j-%u-4gpu-64-slurm.out
#SBATCH --error=pytorch-DP-%j-%u-4gpu-64-slurm.err
#SBATCH --mem=24G # Job memory request
#SBATCH --gres=gpu:4 # Number of requested GPU(s)
#SBATCH --time=3-23:00:00 # Time limit days-hrs:min:sec
#SBATCH --constraint=rtx_6000 # Specific hardware constraint

nvidia-smi

torchrun --nnodes=1 --nproc_per_node=4 main_ddp.py 50 5 --batch_size 64

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions