Why is distributed training not supported? (torch.distributed.run)

This would significantly improve training speed on multi-GPU servers. I noticed that Isaac Lab 2.1 + Isaac Sim 4.5 does support distributed training (see https://isaac-sim.github.io/IsaacLab/v2.1.0/source/features/multi_gpu.html for details).
However, when I run the command in this project as follows:
```bash
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 legged_lab/scripts/train.py --task=walk --headless --distributed
# or
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 legged_lab/scripts/train.py --task=walk --headless --logger=tensorboard --num_envs=4096 --distributed
``` 

it always gets stuck at:
```bash
Finished preloading 
Synchronizing parameters for rank 0....
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is distributed training not supported? (torch.distributed.run) #56

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Why is distributed training not supported? (torch.distributed.run) #56

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions