This would significantly improve training speed on multi-GPU servers. I noticed that Isaac Lab 2.1 + Isaac Sim 4.5 does support distributed training (see https://isaac-sim.github.io/IsaacLab/v2.1.0/source/features/multi_gpu.html for details).
However, when I run the command in this project as follows:
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 legged_lab/scripts/train.py --task=walk --headless --distributed
# or
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 legged_lab/scripts/train.py --task=walk --headless --logger=tensorboard --num_envs=4096 --distributed
it always gets stuck at:
Finished preloading
Synchronizing parameters for rank 0....
This would significantly improve training speed on multi-GPU servers. I noticed that Isaac Lab 2.1 + Isaac Sim 4.5 does support distributed training (see https://isaac-sim.github.io/IsaacLab/v2.1.0/source/features/multi_gpu.html for details).
However, when I run the command in this project as follows:
python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 legged_lab/scripts/train.py --task=walk --headless --distributed # or python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 legged_lab/scripts/train.py --task=walk --headless --logger=tensorboard --num_envs=4096 --distributedit always gets stuck at:
Finished preloading Synchronizing parameters for rank 0....