Skip to content

internode communication with torchrun #47

@pchliu

Description

@pchliu

Hi,

I'm having some issues with torchrun on >1 node. The following code works fine if I ask for 1 node and 1 task, but it does not work on >1 nodes (but it works on other clusters):

#!/bin/bash
#SBATCH -N 2               
#SBATCH -n 2         
...

export NUM_NODES=${SLURM_NNODES}
export GPUS_PER_NODE=4  

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=29500
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO

srun torchrun \
    --nnodes      $NUM_NODES \
    --node_rank   $SLURM_PROCID \
    --nproc_per_node $GPUS_PER_NODE \
    --rdzv_backend c10d \
    --rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \
    --rdzv_id      "$SLURM_JOB_ID" \
    train.py \
        --num_nodes $NUM_NODES \
        ....

See logs attached below from 1 node (ok, 1node_ok_slurm-190800.txt), 2 nodes (failed,
2nodes_fail_slurm-190624.txt), and 2 nodes elsewhere (works,
2nodes_othercluster_ok_slurm-6731888.txt). I think it comes down to something about the rendez-vous ports. I was hoping someone could let me know what's the solution here? Thanks!


Additionally, I saw on the website that

special requests that extend beyond the above queue limits may potentially be accommodated on a case-by-case basis

I was wondering how can I ask to extend beyond the queue limit? We're trying to train some big models and it'd be great if we could access longer queue limits and/or more nodes. Thanks!


Brief discussion of the errors below:

  1. The port I pass with --rdzv_endpoint ( 29500 ) is only used for the initial rendez‑vous.
  2. Rank 0 then asks the OS for any free port (in my log it picked 36609) and publishes that to the other ranks via the rendez‑vous store .
  3. All workers must be able to reach t008‑004.hpcfund:36609. When they cannot, the connect() call in the C10d store times out exactly after the 900 s default, giving the trace.
I0505 22:21:50.411871 1794449 torch/distributed/run.py:649] Using nproc_per_node=4.
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195] Starting elastic_operator with launch configs:
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   entrypoint       : train.py
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   nproc_per_node   : 4
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   rdzv_backend     : c10d
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   rdzv_endpoint    : t008-004:29500
I0505 22:21:50.412315 1794449 torch/distributed/launcher/api.py:195]   rdzv_configs     : {'timeout': 900}
[repeated for the second task/node]

I0505 22:21:50.436545 2283825 torch/distributed/elastic/agent/server/api.py:860] [default] starting workers for entrypoint: python
I0505 22:21:50.436951 2283825 torch/distributed/elastic/agent/server/api.py:677] [default] Rendezvous'ing worker group
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525] [default] Rendezvous complete for workers. Result:
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525]   master_addr=t008-004.hpcfund
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525]   master_port=36609
I0505 22:21:51.593055 1794449 torch/distributed/elastic/agent/server/api.py:525]   group_rank=0
[repeated for the second task/node]
...

[W505 22:22:02.546732811 socket.cpp:755] [c10d] The IPv6 network addresses of (t008-004.hpcfund, 36609) cannot be retrieved (gai error: -2 - Name or service not known).
[repeated many times]
...

[E505 22:30:57.854404041 socket.cpp:1019] [c10d] The client socket has timed out after 600000ms while trying to connect to (t008-004.hpcfund, 36609).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1021 (most recent call first):
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions