Skip to content

Some benchmarking results #2

Open
@markmp

Description

@markmp

For anyone who wants to sanity check, these are the results I was able to achieve with the Tensorflow example in this repo. All of these were run using the DL AMI (Version 13, Ubuntu/Conda) on p3 instances. This was using the default params (90 epochs, 256 batch size, etc.).

1x8 (8 GPUs on single p3.16xlarge node): training took 6 hrs and 12 minutes, 75.421% (top-1), 92.660% (top-5)

2x8 (16 GPUs across 2 p3.16xlarge nodes): training took 3 hrs 16 minutes, 74.868% (top-1), 92.398% (Top-5)

8x8 (64 GPUs total ): training took 54 minutes - 75.404% (top-1 accuracy), 92.65% (top-5 accuracy).

And this was the training bandwidth I was able to achieve. Each machine had it's on copy of data on 256 GB gp2 EBS volume. (did not use the BeeGFS filesystem here) (One test did use a ramdisk, which didn't make much of a difference)

1x1 GPU: 740 img/sec
1x2 GPU: 1481.3 img/sec
1x8 GPU: 5000 img/sec
1x8 GPU: 5100-5200 (ramdisk seems 1-2% faster - but barely noticeable)
2x8 GPU (16 GPUs total): 9860 imgs/sec, (83% efficiency)
4x8 GPU (32 GPUs total): 21300 imgs/sec (with bind_to slot option - aws said might be faster)
8x8 GPU (64 GPUs total): ~37000 imgs/sec. (with bind_to slot option 79% efficient)

These efficiencies were pretty good - but the 64 gpu test definitely did not get the 90% reported in the blog post. I got closer to 79%. Curious if there were any other params/settings that you think would help.

My MPI run command was as follows:
mpirun -np 64 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 --bind-to socket --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=ens3 -mca orte_base_help_aggregate 0 -mca pml ob1 -mca btl_tcp_if_exclude lo,docker0 -mca btl ^openib /home/ubuntu/examples/horovod/cnn/wrapenv.sh python aws_tf_hvd_cnn.py --batch_size=256 --num_epochs=90 --fp16 --data_dir /home/ubuntu/imagenet/train-resized --model resnet50 --log_dir results_2x8gpu_test1 --display_every 100 --save_interval=3600

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions