-
Notifications
You must be signed in to change notification settings - Fork 144
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I've been trying to train yourtts on a google compute instance, but it doesn't seem to work using trainer.distribute.
Previously i could run it, but it would get up to the same point in initialization and then crash one of the training workers, with the others freezing.
I am running largely unchanged code from the provided recipe, and have simply reduced the worker count to work on the cloud instance, and added my own dataset.
It previously trained fine without distributed training until it runs out of vram. and training locally on a 3090 works fine if not slowly.
Also TTS is installed to the latest version, not sure why collect_env_info.py didn't catch it.
To Reproduce
- Run
CUDA_VISIBLE_DEVICES="0,1,2,3" python -m trainer.distribute --script train_yourtts.pyon google compute instance - Wait several seconds
- Error.
Expected behavior
Runs the training script with processing split between the GPUs.
Logs
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1666, in fit
self._fit()
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1618, in _fit
self.train_epoch()
File "/opt/conda/lib/python3.10/site-packages/trainer/trainer.py", line 1350, in train_epoch
for cur_step, batch in enumerate(self.train_loader):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.10/site-packages/TTS/tts/models/vits.py", line 263, in __getitem__
item = self.samples[idx]
TypeError: list indices must be integers or slices, not listEnvironment
{
"CUDA": {
"GPU": [
"Tesla T4",
"Tesla T4",
"Tesla T4",
"Tesla T4"
],
"available": true,
"version": "11.7"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "2.0.1+cu117",
"Trainer": "v0.0.27",
"numpy": "1.23.5"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
""
],
"processor": "",
"python": "3.10.10",
"version": "#1 SMP Debian 5.10.179-1 (2023-05-12)"
}
}Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working