Description
Hello, I am using habitat-sim 0.1.7 in a Docker container. When I train with one 3090 GPU, everything works fine, but when I use two GPUs, the following error occurs. Could you please help me understand why?
CUDA_VISIBLE_DEVICES=0,1 bash run_r2r/main.bash train 2333
train mode
/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appli
cation as needed. *****************************************
2025-01-02 13:50:51,734 Initializing dataset VLN-CE-v1
2025-01-02 13:50:51,734 Initializing dataset VLN-CE-v1
2025-01-02 13:50:52,398 SPLTI: train, NUMBER OF SCENES: 61
2025-01-02 13:50:52,398 SPLTI: train, NUMBER OF SCENES: 61
2025-01-02 13:50:55,648 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,650 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,717 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,720 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,727 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1
2025-01-02 13:50:55,731 Initializing dataset VLN-CE-v1
2025-01-02 13:50:56,349 initializing sim Sim-v1
2025-01-02 13:50:56,351 initializing sim Sim-v1
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
2025-01-02 13:50:56,430 initializing sim Sim-v1
2025-01-02 13:50:56,432 initializing sim Sim-v1
2025-01-02 13:50:56,436 initializing sim Sim-v1
2025-01-02 13:50:56,440 initializing sim Sim-v1
2025-01-02 13:50:56,443 initializing sim Sim-v1
2025-01-02 13:50:56,444 initializing sim Sim-v1
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
Traceback (most recent call last):
File "run.py", line 113, in
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 1
WindowlessContext: Unable to create windowless context
main()
File "run.py", line 49, in main
Platform::WindowlessEglApplication::tryCreateContext(): unable to find EGL device for CUDA device 0
WindowlessContext: Unable to create windowless context run_exp(**vars(args))
File "run.py", line 106, in run_exp
trainer.train()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 451, in train
observation_space, action_space = self._init_envs()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 168, in _init_envs
auto_reset_done=False
File "/home/ETPNav/vlnce_baselines/common/env_utils.py", line 122, in construct_envs
workers_ignore_signals=workers_ignore_signals,
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in init
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "run.py", line 113, in
main()
File "run.py", line 49, in main
run_exp(**vars(args))
File "run.py", line 106, in run_exp
trainer.train()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 451, in train
observation_space, action_space = self._init_envs()
File "/home/ETPNav/vlnce_baselines/ss_trainer_ETP.py", line 168, in _init_envs
auto_reset_done=False
File "/home/ETPNav/vlnce_baselines/common/env_utils.py", line 122, in construct_envs
workers_ignore_signals=workers_ignore_signals,
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in init
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 194, in
read_fn() for read_fn in self._connection_read_fns
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7fec60fd9358>>
Traceback (most recent call last):
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 588, in del
self.close()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 456, in close
read_fn()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError:
Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7f8dac95c358>>
Traceback (most recent call last):
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 588, in del
self.close()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 456, in close
read_fn()
File "/home/ETPNav/habitat-lab/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/ETPNav/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/miniconda3/envs/vlnce/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2299562) of binary: /root/miniconda3/envs/vlnce/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/vlnce/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/miniconda3/envs/vlnce/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/vlnce/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run.py FAILED
=======================================
Root Cause:
[0]:
time: 2025-01-02_13:50:58
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 2299562)
error_file: <N/A>
msg: "Process failed with exitcode 1"
Other Failures:
[1]:
time: 2025-01-02_13:50:58
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 2299563)
error_file: <N/A>
msg: "Process failed with exitcode 1"