Open
Description
Distributed training on multiple devices generates this error.
dcrnn_gpu.py:16: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(f)
/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
2021-05-14 04:43:15.966166: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
File "dcrnn_gpu.py", line 34, in <module>
trainer.fit(model, data["train_loader"], data["val_loader"])
File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 107, in start_training
mp.spawn(self.new_process, **self.mp_spawn_kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
process.start()
File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in _init_
super()._init_(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in _init_
self._launch(process_obj)
File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 131, in reduce_tensor
storage = tensor.storage()
RuntimeError: sparse tensors do not have storage
In a distributed environment, switching sparse tensor implementations to dense operations solves this problem. However, scalability must be taken into consideration, since dense implementations take more memory.