Skip to content

[Bug] distrbute --use_ddp=true timeout with error 1/4 clients joined. #152

@devops724

Description

@devops724

Describe the bug

python -m TTS.bin.train_tts --config_path finetune_config.json --restore_path /home/user/.local/share/tts/tts_models--fa--custom--glow-tts/model_file.pth --use_ddp=true --gpus="0,1,2,3"
Found 24005 files in /home/user/workspace/dataset/train-tts3/dataset
Using model: glow_tts
Setting up Audio Processor...
| sample_rate: 22050
| resample: False
| num_mels: 80
| log_func: np.log10
| min_level_db: -100
| frame_shift_ms: None
| frame_length_ms: None
| ref_level_db: 20
| fft_size: 1024
| power: 1.5
| preemphasis: 0.0
| griffin_lim_iters: 60
| signal_norm: True
| symmetric_norm: True
| mel_fmin: 0
| mel_fmax: None
| pitch_fmin: 1.0
| pitch_fmax: 640.0
| spec_gain: 20.0
| stft_pad_mode: reflect
| max_norm: 4.0
| clip_norm: True
| do_trim_silence: True
| trim_db: 45
| do_sound_norm: False
| do_amp_to_db_linear: True
| do_amp_to_db_mel: True
| do_rms_norm: False
| db_level: None
| stats_path: None
| base: 10
| hop_length: 256
| win_length: 1024
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)

Training Environment:
| > Backend: Torch
| > Mixed precision: True
| > Precision: fp16
| > Current device: 0
| > Num. of GPUs: 4
| > Num. of CPUs: 48
| > Num. of Torch Threads: 24
| > Torch seed: 54321
| > Torch CUDNN: True
| > Torch CUDNN deterministic: False
| > Torch CUDNN benchmark: False
| > Torch TF32 MatMul: False
Start Tensorboard: tensorboard --logdir=glowtts_persian_finetune-March-07-2025_01+37AM-0000000
Using PyTorch DDP
Traceback (most recent call last):
File "", line 198, in runmodule_as_main
File "", line 88, in runcode
File "/home/user/workspace/dataset/coqui-ai-TTS/TTS/bin/train_tts.py", line 77, in
main()
File "/home/user/workspace/dataset/coqui-ai-TTS/TTS/bin/train_tts.py", line 63, in main
trainer = Trainer(
^^^^^^^^
File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/trainer/trainer.py", line 310, in init
init_distributed(
File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/trainer/utils/distributed.py", line 65, in init_distributed
dist.init_process_group(
File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
func_return = func(args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1714, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 226, in tcprendezvous_handler
store = createc10d_store(
^^^^^^^^^^^^^^^^^^^
File "/home/user/workspace/dataset/TTS/venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 194, in createc10d_store
return TCPStore(
^^^^^^^^^
torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/4 clients joined.
cat finetune_config.json {
"run_name": "glowtts_persian_finetune",
"model": "glow_tts",
"batch_size": 8,
"eval_batch_size": 4,
"num_loader_workers": 4,
"num_eval_loader_workers": 4,
"run_eval": true,
"test_delay_epochs": 5,
"epochs": 1000,
"text_cleaner": "phoneme_cleaners",
"use_phonemes": true,
"phoneme_language": "fa",
"phoneme_cache_path": "ph_cache",
"enable_eos_bos_chars": false,
"precompute_num_workers": 4,
"print_step": 10,
"print_eval": true,
"mixed_precision": true,
"output_path": "./",
"lr": 0.0001,
"characters": {
"characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",
"vocabdict": null,
"pad": "",
"eos": "",
"bos": "",
"blank": "",
"characters": "\u02c8\u02cc\u02d0\u02d1pbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029faegiouwy\u026a\u028a\u0329\u00e6\u0251\u0254\u0259\u025a\u025b\u025d\u0268\u0303\u0289\u028c\u028d0123456789"#$%+/=ABCDEFGHIJKLMNOPRSTUVWXYZ[]^{}",
"punctuations": "!(),-.:;? \u0320\u060c\u061b\u061f\u200c<>",
"phonemes": "\u02c8\u02cc\u02d0\u02d1pbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029faegiouwy\u026a\u028a\u0329\u00e6\u0251\u0254\u0259\u025a\u025b\u025d\u0268\u0303\u0289\u028c\u028d0123456789"#$%
+/=ABCDEFGHIJKLMNOPRSTUVWXYZ[]^_{}",
"is_unique": true,
"is_sorted": true
},
"datasets": [
{
"formatter": "ljspeech",
"path": "./dataset/",
"meta_file_train": "tts_dataset.csv",
"ignored_speakers": []
}
],
"test_sentences": [
"\u0633\u0644\u0637\u0627\u0646 \u0645\u062d\u0645\u0648\u062f \u062f\u0631 \u0632\u0645\u0633\u062a\u0627\u0646\u06cc \u0633\u062e\u062a \u0628\u0647 \u0637\u0644\u062e\u06a9 \u06af\u0641\u062a \u06a9\u0647: \u0628\u0627 \u0627\u06cc\u0646 \u062c\u0627\u0645\u0647 \u06cc \u06cc\u06a9 \u0644\u0627 \u062f\u0631 \u0627\u06cc\u0646 \u0633\u0631\u0645\u0627 \u0686\u0647 \u0645\u06cc \u06a9\u0646\u06cc ",
"\u0645\u0631\u062f\u06cc \u0646\u0632\u062f \u0628\u0642\u0627\u0644\u06cc \u0622\u0645\u062f \u0648 \u06af\u0641\u062a \u067e\u06cc\u0627\u0632 \u0647\u0645 \u062f\u0647 \u062a\u0627 \u062f\u0647\u0627\u0646 \u0628\u062f\u0627\u0646 \u062e\u0648 \u0634\u0628\u0648\u06cc \u0633\u0627\u0632\u0645.",
"\u0627\u0632 \u0645\u0627\u0644 \u062e\u0648\u062f \u067e\u0627\u0631\u0647 \u0627\u06cc \u06af\u0648\u0634\u062a \u0628\u0633\u062a\u0627\u0646 \u0648 \u0632\u06cc\u0631\u0647 \u0628\u0627\u06cc\u06cc \u0645\u0639\u0637\u0651\u0631 \u0628\u0633\u0627\u0632",
"\u06cc\u06a9 \u0628\u0627\u0631 \u0647\u0645 \u0627\u0632 \u062c\u0647\u0646\u0645 \u0628\u06af\u0648\u06cc\u06cc\u062f.",
"\u06cc\u06a9\u06cc \u0627\u0633\u0628\u06cc \u0628\u0647 \u0639\u0627\u0631\u06cc\u062a \u062e\u0648\u0627\u0633\u062a"
]
}

To Reproduce

python -m TTS.bin.train_tts --config_path finetune_config.json --restore_path /home/user/.local/share/tts/tts_models--fa--custom--glow-tts/model_file.pth --use_ddp=true --gpus="0,1,2,3"

Expected behavior

No response

Logs

Environment

pip freeze | grep TTS
-e git+https://github.com/idiap/coqui-ai-TTS.git@4c593c620854d9cd2e177382abf48082f7c9f2ae#egg=coqui_tts
pip freeze | grep torch
torch==2.6.0
torchaudio==2.6.0

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions