Description
For anyone who wants to sanity check, these are the results I was able to achieve with the Tensorflow example in this repo. All of these were run using the DL AMI (Version 13, Ubuntu/Conda) on p3 instances. This was using the default params (90 epochs, 256 batch size, etc.).
1x8 (8 GPUs on single p3.16xlarge node): training took 6 hrs and 12 minutes, 75.421% (top-1), 92.660% (top-5)
2x8 (16 GPUs across 2 p3.16xlarge nodes): training took 3 hrs 16 minutes, 74.868% (top-1), 92.398% (Top-5)
8x8 (64 GPUs total ): training took 54 minutes - 75.404% (top-1 accuracy), 92.65% (top-5 accuracy).
And this was the training bandwidth I was able to achieve. Each machine had it's on copy of data on 256 GB gp2 EBS volume. (did not use the BeeGFS filesystem here) (One test did use a ramdisk, which didn't make much of a difference)
1x1 GPU: 740 img/sec
1x2 GPU: 1481.3 img/sec
1x8 GPU: 5000 img/sec
1x8 GPU: 5100-5200 (ramdisk seems 1-2% faster - but barely noticeable)
2x8 GPU (16 GPUs total): 9860 imgs/sec, (83% efficiency)
4x8 GPU (32 GPUs total): 21300 imgs/sec (with bind_to slot option - aws said might be faster)
8x8 GPU (64 GPUs total): ~37000 imgs/sec. (with bind_to slot option 79% efficient)
These efficiencies were pretty good - but the 64 gpu test definitely did not get the 90% reported in the blog post. I got closer to 79%. Curious if there were any other params/settings that you think would help.
My MPI run command was as follows:
mpirun -np 64 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 --bind-to socket --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=ens3 -mca orte_base_help_aggregate 0 -mca pml ob1 -mca btl_tcp_if_exclude lo,docker0 -mca btl ^openib /home/ubuntu/examples/horovod/cnn/wrapenv.sh python aws_tf_hvd_cnn.py --batch_size=256 --num_epochs=90 --fp16 --data_dir /home/ubuntu/imagenet/train-resized --model resnet50 --log_dir results_2x8gpu_test1 --display_every 100 --save_interval=3600