-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hello,
I am trying set up a multinode training code using deepspeed (0.12.2) with torch (2.0.1) and the PDSH (2.35) launcher.
I successfully installed PDSH by hand and all works fine on single node.
However, in 2 node settings, the code blocks in the torch initialization step.
Digging into the details, it seems that it is not managing to establish the inter-node communication (timeout waiting for workers on 2nd node to connect to the master).
I assume this is an NCCL/Infiniband issue, probably miss-configured on my side, however I am having trouble validating this hypothesis as most diagnostic tools require root.
Do you have any intuition as to why this is not working or solutions/options/configurations I should explore please? Can you see any issues in the NCCL config bellow, or could you confirm if everything is working on your side?
Thanks for the help!
Bellow are the NCCL environment variables set:
export NCCL_SOCKET_IFNAME=ib0
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5_0,mlx5_2
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_BLOCKING_WAIT=1
export NCCL_LAUNCH_MODE=PARALLELThe code I am using to test this:
import os
import torch
import torch.distributed
# print('MASTER_PORT = ', os.getenv('MASTER_PORT'))
# print('MASTER_ADDR = ', os.getenv('MASTER_ADDR'))
# print('WORLD_SIZE = ', os.getenv('WORLD_SIZE'))
# print('RANK = ', os.getenv('RANK'))
# print('LOCAL_RANK = ', os.getenv('LOCAL_RANK'))
print('About to initialize PyTorch Distributed...', flush=True)
torch.distributed.init_process_group(backend='nccl') # <=== Hangs here
print('Completed initialization of PyTorch Distributed', flush=True)
print('Entering barrier...', flush=True)
torch.distributed.barrier()
print('Done with barrier', flush=True)Command used to run:
deepspeed --hostfile [absolute path]/[jobid]-hosts [absolute path]/test.py
Sample hostfile:
t007-007 slots=8
t007-008 slots=8