Skip to content

[Bug] Unable to run distributed training using TTS recipe for yourtts #113

@SenkuwuWasTaken

Description

@SenkuwuWasTaken

Describe the bug

I've been trying to train yourtts on a google compute instance, but it doesn't seem to work using trainer.distribute.
Previously i could run it, but it would get up to the same point in initialization and then crash one of the training workers, with the others freezing.
I am running largely unchanged code from the provided recipe, and have simply reduced the worker count to work on the cloud instance, and added my own dataset.
It previously trained fine without distributed training until it runs out of vram. and training locally on a 3090 works fine if not slowly.

Also TTS is installed to the latest version, not sure why collect_env_info.py didn't catch it.

To Reproduce

  1. Run CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --script train_yourtts.py on google compute instance
  2. Wait several seconds
  3. Error.

Expected behavior

Runs the training script with processing split between the GPUs.

Logs

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1666, in fit
    self._fit()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1618, in _fit
    self.train_epoch()
  File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1350, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.10/site-packages/TTS/tts/models/vits.py", line 263, in __getitem__
    item = self.samples[idx]
TypeError: list indices must be integers or slices, not list

Environment

{       
    "CUDA": {
        "GPU": [
            "Tesla T4",
            "Tesla T4",
            "Tesla T4",
            "Tesla T4"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1+cu117",
        "Trainer": "v0.0.27",
        "numpy": "1.23.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "",
        "python": "3.10.10",
        "version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
    }
}

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions