-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Hi all,
I'd like to run an IsaacLab container on a cluster powered by enroot. I build my container with ./docker/container.py start. Then, I have my docker container.
When I run docker run --entrypoint tail --gpus all isaac-lab-base -f /dev/null on my local machine (NOTE: overwriting the entrypoint is important, otherwise I think IsaacSim starts in the container and my trainings are 3x slower.), I get the usual training speed.
However, when I try to start the same container with enroot on the cluster, I get ~3x slower training speed, also when I overwrite the containers entrypoint.
The speed on the cluster should really be the same as on my local machine (using RTX4090 and H100). I made sure that I have GPU access on local and remote. If I train a plain pytorch NN I get the same training speeds.
I tried for about 1.5 days and all options that slurm, docker, and enroot provide.
Did anyone make a similar experience or has an idea what the issue could be?