-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hi,
I'm having some issues with torchrun on >1 node. The following code works fine if I ask for 1 node and 1 task, but it does not work on >1 nodes (but it works on other clusters):
#!/bin/bash
#SBATCH -N 2
#SBATCH -n 2
...
export NUM_NODES=${SLURM_NNODES}
export GPUS_PER_NODE=4
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=29500
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO
srun torchrun \
--nnodes $NUM_NODES \
--node_rank $SLURM_PROCID \
--nproc_per_node $GPUS_PER_NODE \
--rdzv_backend c10d \
--rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \
--rdzv_id "$SLURM_JOB_ID" \
train.py \
--num_nodes $NUM_NODES \
....
See logs attached below from 1 node (ok, 1node_ok_slurm-190800.txt), 2 nodes (failed,
2nodes_fail_slurm-190624.txt), and 2 nodes elsewhere (works,
2nodes_othercluster_ok_slurm-6731888.txt). I think it comes down to something about the rendez-vous ports. I was hoping someone could let me know what's the solution here? Thanks!
Additionally, I saw on the website that
special requests that extend beyond the above queue limits may potentially be accommodated on a case-by-case basis
I was wondering how can I ask to extend beyond the queue limit? We're trying to train some big models and it'd be great if we could access longer queue limits and/or more nodes. Thanks!
Brief discussion of the errors below:
- The port I pass with --rdzv_endpoint ( 29500 ) is only used for the initial rendez‑vous.
- Rank 0 then asks the OS for any free port (in my log it picked 36609) and publishes that to the other ranks via the rendez‑vous store .
- All workers must be able to reach t008‑004.hpcfund:36609. When they cannot, the connect() call in the C10d store times out exactly after the 900 s default, giving the trace.
I0505 22:21:50.411871 1794449 torch/distributed/run.py:649] Using nproc_per_node=4.
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195] Starting elastic_operator with launch configs:
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195] entrypoint : train.py
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195] nproc_per_node : 4
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195] rdzv_backend : c10d
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195] rdzv_endpoint : t008-004:29500
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195] rdzv_configs : {'timeout': 900}
[repeated for the second task/node]
I0505 22:21:50.436545 2283825 torch/distributed/elastic/agent/server/api.py:860] [default] starting workers for entrypoint: python
I0505 22:21:50.436951 2283825 torch/distributed/elastic/agent/server/api.py:677] [default] Rendezvous'ing worker group
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525] master_addr=t008-004.hpcfund
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525] master_port=36609
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525] group_rank=0
[repeated for the second task/node]
...
[W505 22:22:02.546732811 socket.cpp:755] [c10d] The IPv6 network addresses of (t008-004.hpcfund, 36609) cannot be retrieved (gai error: -2 - Name or service not known).
[repeated many times]
...
[E505 22:30:57.854404041 socket.cpp:1019] [c10d] The client socket has timed out after 600000ms while trying to connect to (t008-004.hpcfund, 36609).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1021 (most recent call first):
...