Add support for distributed training

Distributed training on multiple devices generates this error.

```
dcrnn_gpu.py:16: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
2021-05-14 04:43:15.966166: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "dcrnn_gpu.py", line 34, in <module>
    trainer.fit(model, data["train_loader"], data["val_loader"])
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 107, in start_training
    mp.spawn(self.new_process, **self.mp_spawn_kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 179, in start_processes
    process.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in _init_
    super()._init_(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in _init_
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 131, in reduce_tensor
    storage = tensor.storage()
RuntimeError: sparse tensors do not have storage
```

In a distributed environment, switching sparse tensor implementations to dense operations solves this problem. However, scalability must be taken into consideration, since dense implementations take more memory. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for distributed training #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for distributed training #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions