-
Notifications
You must be signed in to change notification settings - Fork 66
Open
Description
Hi,
I'm trying to benchmark multi-node allgather perf using param tests for buffers up to 2G. but the test will OOM at buffer size around 1G. While the same config works for nccl-tests. Any ideas or insight will be helpful. Thank you!. AR and RS tests are fine and results are very similar to nccl-tests. You can reproduce this on A100-40G /H100 clusters. (p4d or p5 on AWS)
PyTorch nightly with cuda 12.1 or PyTorch 2.0.1 with CUDA 11.8
for param I'm launching the following way
mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
--tag-output \
--oversubscribe --allow-run-as-root \
$MPI_OPTIONS /fsx/lawei/param/train/comms/pt/comms.py \
--master-ip ip-172-31-49-213 \
--b 32M \
---e 2048M \
--n 100 \
--z 0 \
--backend nccl \
--device cuda \
--collective all_gather\
for nccl-test, I'm using NCCL 2.18.3 + CUDA 12.1, but older version also works.
mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
--tag-output \
--oversubscribe --allow-run-as-root \
bash run_nccl_test.sh
and in the bash file
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:$LD_LIBRARY_PATH
export NCCL_DEBUG=INFO
export FI_EFA_USE_DEVICE_RDMA=1
/usr/local/cuda-12.1/efa/test-cuda-12.1/all_gather_perf -b 32M -e 2048M -n 100 -z 0 -f 2 -g 1
mzaran-amd
Metadata
Metadata
Assignees
Labels
No labels