internode communication with torchrun

Hi, 

I'm having some issues with `torchrun` on >1 node. The following code works fine if I ask for 1 node and 1 task, but it does not work on >1 nodes (but it works on other clusters):

```
#!/bin/bash
#SBATCH -N 2               
#SBATCH -n 2         
...

export NUM_NODES=${SLURM_NNODES}
export GPUS_PER_NODE=4  

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=29500
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO

srun torchrun \
    --nnodes      $NUM_NODES \
    --node_rank   $SLURM_PROCID \
    --nproc_per_node $GPUS_PER_NODE \
    --rdzv_backend c10d \
    --rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \
    --rdzv_id      "$SLURM_JOB_ID" \
    train.py \
        --num_nodes $NUM_NODES \
        ....
```

See logs attached below from 1 node (ok, [1node_ok_slurm-190800.txt](https://github.com/user-attachments/files/20055034/1node_ok_slurm-190800.txt)), 2 nodes (failed, 
[2nodes_fail_slurm-190624.txt](https://github.com/user-attachments/files/20055035/2nodes_fail_slurm-190624.txt)), and 2 nodes elsewhere (works, 
[2nodes_othercluster_ok_slurm-6731888.txt](https://github.com/user-attachments/files/20055036/2nodes_othercluster_ok_slurm-6731888.txt)). I think it comes down to something about the rendez-vous ports. I was hoping someone could let me know what's the solution here? Thanks!

-----------

Additionally, I saw on the website that 
>special requests that extend beyond the above queue limits may potentially be accommodated on a case-by-case basis

I was wondering how can I ask to extend beyond the queue limit? We're trying to train some big models and it'd be great if we could access longer queue limits and/or more nodes. Thanks!

-----------

***Brief discussion of the errors below***:

1. The port I pass with --rdzv_endpoint ( 29500 ) is only used for the initial rendez‑vous.
2. Rank 0 then asks the OS for any free port (in my log it picked 36609) and publishes that to the other ranks via the rendez‑vous store .
3. All workers must be able to reach t008‑004.hpcfund:36609. When they cannot, the connect() call in the C10d store times out exactly after the 900 s default, giving the trace.


```
I0505 22:21:50.411871 1794449 torch/distributed/run.py:649] Using nproc_per_node=4.
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195] Starting elastic_operator with launch configs:
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   entrypoint       : train.py
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   nproc_per_node   : 4
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   rdzv_backend     : c10d
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   rdzv_endpoint    : t008-004:29500
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   rdzv_configs     : {'timeout': 900}
[repeated for the second task/node]

I0505 22:21:50.436545 2283825 torch/distributed/elastic/agent/server/api.py:860] [default] starting workers for entrypoint: python
I0505 22:21:50.436951 2283825 torch/distributed/elastic/agent/server/api.py:677] [default] Rendezvous'ing worker group
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525]   master_addr=t008-004.hpcfund
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525]   master_port=36609
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525]   group_rank=0
[repeated for the second task/node]
...

[W505 22:22:02.546732811 socket.cpp:755] [c10d] The IPv6 network addresses of (t008-004.hpcfund, 36609) cannot be retrieved (gai error: -2 - Name or service not known).
[repeated many times]
...

[E505 22:30:57.854404041 socket.cpp:1019] [c10d] The client socket has timed out after 600000ms while trying to connect to (t008-004.hpcfund, 36609).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1021 (most recent call first):
...
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

internode communication with torchrun #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

internode communication with torchrun #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions