OOM for all gather comms tests

Hi, 

I'm trying to benchmark multi-node allgather perf using param tests for buffers up to 2G. but the test will OOM at buffer size around 1G. While the same config works for nccl-tests. Any ideas or insight will be helpful. Thank you!. AR and RS tests are fine and results are very similar to nccl-tests. You can reproduce this on A100-40G /H100 clusters. (p4d or p5 on AWS)

PyTorch nightly with cuda 12.1 or PyTorch 2.0.1 with CUDA 11.8

for param I'm launching the following way
```
mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      $MPI_OPTIONS /fsx/lawei/param/train/comms/pt/comms.py \
      --master-ip ip-172-31-49-213 \
      --b 32M \
     ---e 2048M \
      --n 100 \
      --z 0 \
      --backend nccl \
      --device cuda \
      --collective all_gather\
```

for nccl-test, I'm using NCCL 2.18.3 + CUDA 12.1, but older version also works.
```
mpirun -np $(($NUM_NODES*8)) -N 8 --hostfile $HOST_FILE \
      --tag-output \
      --oversubscribe --allow-run-as-root \
      bash run_nccl_test.sh
```
and in the bash file
```
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:$LD_LIBRARY_PATH
export NCCL_DEBUG=INFO
export FI_EFA_USE_DEVICE_RDMA=1
/usr/local/cuda-12.1/efa/test-cuda-12.1/all_gather_perf -b 32M -e 2048M  -n 100  -z 0 -f 2 -g 1
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM for all gather comms tests #84

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM for all gather comms tests #84

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions