Skip to content

Why is distributed training not supported? (torch.distributed.run) #56

@Homerain

Description

@Homerain

This would significantly improve training speed on multi-GPU servers. I noticed that Isaac Lab 2.1 + Isaac Sim 4.5 does support distributed training (see https://isaac-sim.github.io/IsaacLab/v2.1.0/source/features/multi_gpu.html for details).
However, when I run the command in this project as follows:

python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 legged_lab/scripts/train.py --task=walk --headless --distributed
# or
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 legged_lab/scripts/train.py --task=walk --headless --logger=tensorboard --num_envs=4096 --distributed

it always gets stuck at:

Finished preloading 
Synchronizing parameters for rank 0....

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions