Problem setting up inter-node communication in deepspeed, seems to be an infiniband issue?

Hello,
I am trying set up a multinode training code using deepspeed (0.12.2) with torch (2.0.1) and the PDSH (2.35) launcher.
I successfully installed PDSH by hand and all works fine on single node. 
However, in 2 node settings, the code blocks in the torch initialization step.
Digging into the details, it seems that it is not managing to establish the inter-node communication (timeout waiting for workers on 2nd node to connect to the master).
I assume this is an NCCL/Infiniband issue, probably miss-configured on my side, however I am having trouble validating this hypothesis as most diagnostic tools require root.

Do you have any intuition as to why this is not working or solutions/options/configurations I should explore please? Can you see any issues in the NCCL config bellow, or could you confirm if everything is working on your side?
Thanks for the help!

Bellow are the NCCL environment variables set:
```sh
export NCCL_SOCKET_IFNAME=ib0
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5_0,mlx5_2
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_BLOCKING_WAIT=1
export NCCL_LAUNCH_MODE=PARALLEL
```

The code I am using to test this:
```py
import os

import torch
import torch.distributed

# print('MASTER_PORT = ', os.getenv('MASTER_PORT'))
# print('MASTER_ADDR = ', os.getenv('MASTER_ADDR'))
# print('WORLD_SIZE = ', os.getenv('WORLD_SIZE'))
# print('RANK = ', os.getenv('RANK'))
# print('LOCAL_RANK = ', os.getenv('LOCAL_RANK'))

print('About to initialize PyTorch Distributed...', flush=True)
torch.distributed.init_process_group(backend='nccl') # <=== Hangs here
print('Completed initialization of PyTorch Distributed', flush=True)

print('Entering barrier...', flush=True)
torch.distributed.barrier()
print('Done with barrier', flush=True)
```

Command used to run:
```deepspeed --hostfile [absolute path]/[jobid]-hosts [absolute path]/test.py```

Sample hostfile:
```
t007-007 slots=8
t007-008 slots=8
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem setting up inter-node communication in deepspeed, seems to be an infiniband issue? #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problem setting up inter-node communication in deepspeed, seems to be an infiniband issue? #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions