Best Procedure For Voice Cloning - My Experience So Far #2507

danablend · 2023-04-12T18:10:20Z

danablend
Apr 12, 2023

Scroll down to the replies to hear samples!

VITS vs YourTTS - the voice cloning showdown

Let's get to the bottom of this, once and for all!

All models are mentioned here are in the English language.

Context

Been looking for the best framework to clone my voice on a limited amount of audio (20-25 minutes), while also being fast at training and high audio quality in the output.

Been playing with VITS and YourTTS for this so far, as the research papers suggest them to be the highest quality ones unless we're getting into the Microsoft Natural Speech paper from May 2022, but no public implementation for this exists thus far.

Here are the procedures I have been playing with for VITS & YourTTS

Settings I use to fine tune for both VITS & YourTTS:

lr=0.0001
target_loss="loss_1"
use_phonemes=False
batch_size=16

VITS Fine Tuning Procedure

Load 1m steps pretrained vctk-vits model
Load in 20 minutes of pre-processed audio samples of new speaker to clone (noise filtering with rnnoise, transcribed with OpenAI Whisper)
Fine tuning: Train VITS model by restoring path to 1m step pretrained vctk-vits model, then point to audio samples for new dataset
Sounds pretty decent after around 50k steps (like 4/5. Could be better, but good quality)
After 50k steps, it seems to be over-fitting, and quality degrades.
Very straight forward process. Everything works as you'd expect, and the quality is good but could be slightly better.

YourTTS Fine Tuning Procedure

Load 1m steps pretrained yourtts multilingual dataset
Load in 20 minutes of pre-processed audio samples of new speaker to clone (noise filtering with rnnoise, transcribed with OpenAI Whisper)
Fine tuning: Train VITS model by restoring path to 1m step pretrained vctk-vits model, then point to audio samples for new dataset. Currently testing with SCL enabled (speaker_encoder_loss_alpha=9.0), but research papers suggest that quality is better without SCL so will test it without after.
Have only trained this up to 13k steps, but it is much less intelligible than VITS at 13k steps. (can't make out the words yet)
25% faster time per epoch than VITS (so would be a great candidate if the quality was good)
After training when generating speech, it only works when I give it a speaker_wav file (even though it's trained on the new speaker already). Currently I am giving it a random .wav file from my new speaker's dataset, but that is a little odd.
Not very straight forward process. The training recipe is very useful, but there is very little documentation or examples on how to fine tune YourTTS for a new speaker, and I had to patch together something from hours of internet research.

If you know of better ways to do anything, I would be extremely happy to hear about your procedure for fine tuning on a new speaker!

Existing Documentation

The procedure to train the VITS model and fine tune it to a new speaker seems straight forward, and there is enough information out there to make it work.

I cannot say the same thing for YourTTS, which does not have as much documentation around it, since it was released so recently.

YourTTS seems to beat VITS in voice quality when fine tuning to new speakers, based on the research papers for the two:

YourTTS (Synthesized MOS=4.21 compared to Ground Truth MOS=4.26): https://arxiv.org/abs/2112.02418
VITS (Synthesized MOS=4.31 compared to Ground Truth MOS=4.50): https://arxiv.org/pdf/2106.06103

From the YourTTS paper, it also suggests that much less data is required for high quality fine tuning on new speakers, which is a great upside.

Check NanoNomad's YouTube videos - they are great guides include Google Colabs. Very helpful if you're trying to get something to just run.

My Question For You

What are your experiences with VITS and YourTTS, and do you prefer one over the other? Would be very interested in the quality you experience, and how your procedure looks like for YourTTS as it's less documented?

ramonrovirosa · 2023-04-13T05:55:13Z

ramonrovirosa
Apr 13, 2023

@danablend thanks for sharing your thoughts! Do you mind posting some audio samples of the trained outputs. What are some of your fine tuning parameters for your BaseAudioConfig that has given you the best results? Haven't found much documentation on that online?

6 replies

Nanayeb34 May 17, 2023

Kindly help me understand this. Are you finetuning or training from scratch? when you say 40k steps, does the 40k include the number of steps of the already pretrained model or it is 40k steps of just finetuning? @danablend

Harrolee May 18, 2023

Finetuning the pretrained VITS model. 40k steps of just finetuning.

Nanayeb34 Jun 28, 2023

Thank you

truszko1 Jul 20, 2023

@Nanayeb34
Thank you for this! Maybe a silly question, but "restore_path" in your recipe is just the path to the "tts_models/en/ljspeech/vits" model? or "tts_models/en/vctk/vits"? or something else?

Nanayeb34 Jul 20, 2023

Yes @truszko1
restore_path is the path to the pretrained model for fine tuning. If you've already begun fine tuning, you can pass your last checkpoint to restore_path to continue training from there.

danablend · 2023-04-13T15:33:27Z

danablend
Apr 13, 2023
Author

So I just skimmed over the YourTTS paper again, and it looks like the speaker_wav input is very important as it's used to extract the speaker embeddings to use as a reference voice during inference.

In the YourTTS paper, they used a clear 20-word sentence (about 5-6s) from the speaker as speaker embedding reference (see here: https://edresson.github.io/YourTTS). Seemed to generate great results!

Will play around with this speaker embedding stuff a bunch today - if anyone has some know-how on the best way to select a speaker embedding reference file to get great results, let us know? :)

So the YourTTS procedure seems more clear? Something like this perhaps could give us some good results:

Data collection, pre-processing.
Fine tune YourTTS pretrained model to new speaker dataset for about 25k - 50k steps.
Select the best 5-6s clip from the new speaker's dataset with excellent speaker embedding quality.
Use TTS and set speaker_wav to the selected emb. reference audio sample.

If anyone has any inputs or ways to do this better, would love to hear your insights!

4 replies

wonkothesanest Apr 21, 2023

Hi @danablend thank you so much for posting this detailed write up on how you trained your models. I'm currently trying to train my model on the vits and I am running into a problem where I can tell it is getting very close to my input audio but the output sounds very robotic or chopped. I was hoping that you could maybe take a listen to an example audio file and let me know if you've run into a similar output and had any suggestions for my training.

https://drive.google.com/file/d/1v4COu8kLNwGdB8pEiJAaoIT6oMGgXTbL/view?usp=drivesdk

I know it's a long shot but thank you very much for checking it out. Next up I think I'm going to take your suggestion with trying Your TTS and see if I get better results after that.

Harrolee Apr 23, 2023

Hi @wonkothesanest! How many steps into the process did you take this sample from?

I got similar results in samples from a finetuned lj-speech VITS model at early checkpoints. The samples sound gradually less robotic with training.

audio at 800steps
audio at 2000steps
Note that the 2000step sample sounds robotic but much less robotic than the 800step sample. I'd bet that the quality would improve significantly if I let it run for the 50k steps Daniel recommends in his VITS writeup above.

Config settings that differ from @danablend:

Learning Rate: 0.001
Use Phonemes: true
Target Loss: null

I did not clean my dataset with Rnnoise. I haven't looked into what it does yet, but I'd sure like to put all the 'robot-sound-sources' into the bucket of things that Rnnoise can delete for me. I'll do that, train another 2k steps, and post a sample here. Then I'll let it run for many more steps.

wonkothesanest Apr 25, 2023

Hey thank you so much @Harrolee , your comment really helped. I finally did take it to 100K steps (although like danablend said, it doesn't improve much after 50k) and the voice got much smoother.
I was thinking there must have been something wrong with my setup because the voice was so distinct but I guess that's just how it goes. I did find that adding in the high and low pass filters helped clear up the voice as well.

mel_fmin=150, 
mel_fmax=9000,

My thought process (could be wrong) is that the choppy nature of the voice sounded like a very, very low frequency clipping, so maybe not allowing to pick up artifacts at the low end was helpful to remove the artifact we both have.
Although I don't know what your training data sounds like, I didn't find RNnoise to make much of a difference.

My final best model, I like it but I'm still going to try to do better with yourtts. Best of luck to you!
https://drive.google.com/file/d/1---f4PNCzY83QtO4ANiDKzcaNmm0bcFR/view?usp=sharing

offside609 Apr 27, 2023

You can check multistage training done by Nanonomad youtube channel. You can reinitiate DP and text encoder at the beginning to freeze them later and train other layers to get better results.

offside609 · 2023-04-27T10:54:56Z

offside609
Apr 27, 2023

@danablend Thanks for initiating this conversation. I agree with the findings mostly. I have taken Nanonomads tutorials by heart and I am training a hinglish model now. Can you help me with more resources on how to train VITS model? You mentioned there are enough resources on internet for that.

2 replies

danablend Apr 27, 2023
Author

Hey! You should just be able to use the train_vits.py recipe that the Coqui TTS devs built for us (it's in the repository under the "recipes" folder). Has pretty much everything you need. You just need to switch out the dataset to your dataset, and then restore from one of the pretrained models if you are fine tuning. Then set up the hyper parameters too, of course.

Dataset-wise:
For fine-tuning you will be okay with just 20 minutes of audio samples cut into 3-10s clips
For training a new model from scratch, I haven't played around with it, but I think you would need somewhere around 5 - 25 hours of audio samples (more = better). LJSpeech is 24 hours. VCTK is very long too, and LibriSpeech is 1000 hours.

markrmiller Nov 14, 2023

Just an FYI, about 8 months ago I trained a couple voices from scratch with vits, one with a very heavy German accent. I got great results with only 40 min to an hour of audio. It did take a couple hundred thousand steps before the quality was okay and 500-700 before it started getting really good, but I'm not sure the 4-5 hours of audio is accurate, though I have seen that info elsewhere as well.

harshvardhan-truefan · 2023-05-24T07:36:35Z

harshvardhan-truefan
May 24, 2023

Hi @danablend,
Hope you are doing well. I went through your tips and thanks for sharing it with us.
I'm trying to fine-tune the vits model to my custom dataset to clone the voice of the speaker. I have approximately 15 minutes of dataset of the speaker. When I train the model, the output wav file sounds like a chipmunk and is not sounding any way close to the speaker. I also trained the model with an external d_vector_file, but still the same result. Any suggestions would be of great help. Look forward to hearing form you.

Regards,
Harsh

This the config settings:

dataset_config = BaseDatasetConfig(
formatter="ljspeech",
meta_file_train="metadata.csv",
path=VCTK_DOWNLOAD_PATH,
)

audio_config = BaseAudioConfig(
    sample_rate=22050,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    mel_fmin=0,
    mel_fmax=None
)

model_args = VitsArgs(
    d_vector_file=D_VECTOR_FILES,
    use_d_vector_file=True,
    d_vector_dim=512,
    num_layers_text_encoder=10,
    speaker_encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    speaker_encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH,
    resblock_type_decoder="2",
    # Usefull parameters to enable the Speaker Consistency Loss (SCL) discribed in the paper
    use_speaker_encoder_as_loss=True,
    # Detach duration predictor's input from the network for stopping the gradients. Defaults to True.
    # freeze_encoder=True,
    # freeze_DP=True,
    # Usefull parameters to the enable multilingual training
    # use_language_embedding=True,
    # embedded_language_dim=4,
)

config = VitsConfig(
    model_args=model_args,
    audio=audio_config,
    run_name="vits_vctk-kk-2",
    batch_size=16,
    eval_batch_size=16,
    batch_group_size=16,
    #    num_loader_workers=8,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    save_step=1000,
    target_loss="loss_1",
    lr_gen = 0.0001,
    lr_disc = 0.0001,
    save_checkpoints=True,
    save_n_checkpoints=4,
    save_best_after=50,
    # text_cleaner="english_cleaners",
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="en",
    phoneme_cache_path=os.path.join(OUT_PATH, "phoneme_cache"),
    compute_input_seq_cache=True,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=OUT_PATH,
    datasets=[dataset_config],
    cudnn_benchmark=False,
    test_sentences=[
        "Hi Kamlesh, Happy Birthday to you. Have a great day.",
        "Hi Shivang, wish you a very Happy New Year.",
        "Hi Yash, hope you are okay",
        "Hi Harsh",
        "Hi Katrina, have a good day",
    ],
)

# INITIALIZE THE AUDIO PROCESSOR
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
tokenizer, config = TTSTokenizer.init_from_config(config)

print("Dataset Config:", dataset_config)

train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

speaker_manager = SpeakerManager(
    d_vectors_file_path=D_VECTOR_FILES,
    encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH
)

output.mp4

1 reply

scape7575 Jun 8, 2023

Hi @danablend, Hope you are doing well. I went through your tips and thanks for sharing it with us. I'm trying to fine-tune the vits model to my custom dataset to clone the voice of the speaker. I have approximately 15 minutes of dataset of the speaker. When I train the model, the output wav file sounds like a chipmunk and is not sounding any way close to the speaker. I also trained the model with an external d_vector_file, but still the same result. Any suggestions would be of great help. Look forward to hearing form you.

Regards, Harsh

This the config settings:

dataset_config = BaseDatasetConfig( formatter="ljspeech", meta_file_train="metadata.csv", path=VCTK_DOWNLOAD_PATH, )

audio_config = BaseAudioConfig(
    sample_rate=22050,
    win_length=1024,
    hop_length=256,
    num_mels=80,
    mel_fmin=0,
    mel_fmax=None
)

model_args = VitsArgs(
    d_vector_file=D_VECTOR_FILES,
    use_d_vector_file=True,
    d_vector_dim=512,
    num_layers_text_encoder=10,
    speaker_encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    speaker_encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH,
    resblock_type_decoder="2",
    # Usefull parameters to enable the Speaker Consistency Loss (SCL) discribed in the paper
    use_speaker_encoder_as_loss=True,
    # Detach duration predictor's input from the network for stopping the gradients. Defaults to True.
    # freeze_encoder=True,
    # freeze_DP=True,
    # Usefull parameters to the enable multilingual training
    # use_language_embedding=True,
    # embedded_language_dim=4,
)

config = VitsConfig(
    model_args=model_args,
    audio=audio_config,
    run_name="vits_vctk-kk-2",
    batch_size=16,
    eval_batch_size=16,
    batch_group_size=16,
    #    num_loader_workers=8,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    save_step=1000,
    target_loss="loss_1",
    lr_gen = 0.0001,
    lr_disc = 0.0001,
    save_checkpoints=True,
    save_n_checkpoints=4,
    save_best_after=50,
    # text_cleaner="english_cleaners",
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="en",
    phoneme_cache_path=os.path.join(OUT_PATH, "phoneme_cache"),
    compute_input_seq_cache=True,
    print_step=25,
    print_eval=True,
    mixed_precision=True,
    output_path=OUT_PATH,
    datasets=[dataset_config],
    cudnn_benchmark=False,
    test_sentences=[
        "Hi Kamlesh, Happy Birthday to you. Have a great day.",
        "Hi Shivang, wish you a very Happy New Year.",
        "Hi Yash, hope you are okay",
        "Hi Harsh",
        "Hi Katrina, have a good day",
    ],
)

# INITIALIZE THE AUDIO PROCESSOR
ap = AudioProcessor.init_from_config(config)

# INITIALIZE THE TOKENIZER
tokenizer, config = TTSTokenizer.init_from_config(config)

print("Dataset Config:", dataset_config)

train_samples, eval_samples = load_tts_samples(
    dataset_config,
    eval_split=True,
    eval_split_max_size=config.eval_split_max_size,
    eval_split_size=config.eval_split_size,
)

speaker_manager = SpeakerManager(
    d_vectors_file_path=D_VECTOR_FILES,
    encoder_model_path=SPEAKER_ENCODER_CHECKPOINT_PATH,
    encoder_config_path=SPEAKER_ENCODER_CONFIG_PATH
)

output.mp4

Hey @harshvardhan-truefan , this is most likely due to a mismatch in your dataset's sampling rate and what sampling rate is declared in BaseAudioConfig( sample_rate= )

Nanayeb34 · 2023-06-28T13:28:14Z

Nanayeb34
Jun 28, 2023

@danablend @Harrolee
How are you able to run inference on VITS without passing in a speaker reference wav? Mine throws an error when I don't pass in a speaker reference wav.

!tts --model_path "/content/drive/MyDrive/NSMQ AI Project/Technical/TTS/Prof kaufman finetuning/VITS model/vits-elsie/traineroutput/vits_vctk-May-24-2023_11+05PM-23a7a9a3/checkpoint_1054000.pth" \
--config_path "/content/drive/MyDrive/NSMQ AI Project/Technical/TTS/Prof kaufman finetuning/VITS model/vits-elsie/traineroutput/vits_vctk-May-24-2023_11+05PM-23a7a9a3/config.json" \
--speaker_idx VCTK_elsie \
--text "This is the National Science and Maths Quiz.\
 first question.\
 Two angles are complementary and the sum of twice one angle and thrice the other angle is two thousand two hundred. Find the measures of the angles.\
 next set of questions.\
 What is renal pelvis?\
 Which class of alkanols when oxidized forms Alkanones?\
 What is the functional group in Alkanes? \
 Which cellular organelle in animal cells forms vesicles that give rise to lysosomes? \
 Why is a small amount of fluoride added to toothpaste? \
 Mammalian red blood cells have a biconcave shape and flexible cell membrane, what are the benefits of this structure?\
 Which part of the eye is specialized for the perception of colour and how is it specialized for such a function?\
 Graphite consists of hexagonal rings of carbon which slide over one another on application of force.\
 The biconcave shape gives it a large surface area for the exchange of gases while the flexible cell membrane enables it to squeeze through the capillaries without ruptures.\
 Everything you just heard is not real. It was produced by an AI." \
  --out_path '/content/ckpt_54000.wav'

I get the ff error

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Model fully restored. 
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:64
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:True
 | > db_level:-27.0
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > Text: This is the National Science and Maths Quiz.  first question.  Two angles are complementary and the sum of twice one angle and thrice the other angle is two thousand two hundred. Find the measures of the angles.  next set of questions.  What is renal pelvis?  Which class of alkanols when oxidized forms Alkanones?  What is the functional group in Alkanes?   Which cellular organelle in animal cells forms vesicles that give rise to lysosomes?   Why is a small amount of fluoride added to toothpaste?   Mammalian red blood cells have a biconcave shape and flexible cell membrane, what are the benefits of this structure?  Which part of the eye is specialized for the perception of colour and how is it specialized for such a function?  Graphite consists of hexagonal rings of carbon which slide over one another on application of force.  The biconcave shape gives it a large surface area for the exchange of gases while the flexible cell membrane enables it to squeeze through the capillaries without ruptures.  Everything you just heard is not real. It was produced by an AI.
 > Text splitted to sentences.
['This is the National Science and Maths Quiz.', 'first question.', 'Two angles are complementary and the sum of twice one angle and thrice the other angle is two thousand two hundred.', 'Find the measures of the angles.', 'next set of questions.', 'What is renal pelvis?', 'Which class of alkanols when oxidized forms Alkanones?', 'What is the functional group in Alkanes?', 'Which cellular organelle in animal cells forms vesicles that give rise to lysosomes?', 'Why is a small amount of fluoride added to toothpaste?', 'Mammalian red blood cells have a biconcave shape and flexible cell membrane, what are the benefits of this structure?', 'Which part of the eye is specialized for the perception of colour and how is it specialized for such a function?', 'Graphite consists of hexagonal rings of carbon which slide over one another on application of force.', 'The biconcave shape gives it a large surface area for the exchange of gases while the flexible cell membrane enables it to squeeze through the capillaries without ruptures.', 'Everything you just heard is not real.', 'It was produced by an AI.']
Traceback (most recent call last):
  File "/usr/local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/TTS/bin/synthesize.py", line 431, in main
    wav = synthesizer.tts(
  File "/usr/local/lib/python3.10/dist-packages/TTS/utils/synthesizer.py", line 374, in tts
    outputs = synthesis(
  File "/usr/local/lib/python3.10/dist-packages/TTS/tts/utils/synthesis.py", line 213, in synthesis
    outputs = run_model_torch(
  File "/usr/local/lib/python3.10/dist-packages/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
    outputs = _func(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/TTS/tts/models/vits.py", line 1161, in inference
    o = self.waveform_decoder((z * y_mask)[:, :, : self.max_inference_len], g=g)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/TTS/vocoder/models/hifigan_generator.py", line 250, in forward
    o = o + self.cond_layer(g)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:

 * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
      didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple of (int,), tuple of (int,), tuple of (int,), int)
 * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
      didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple of (int,), tuple of (int,), tuple of (int,), int)

One more thing. The dataset I used to train my model was low-pitched hence the volume of the synthesized speech is low. Can I modify any parameter in the audio config to increase the volume of the synthesized speech?

3 replies

truszko1 Jul 20, 2023

Add --speaker_wav to fix this error.

Nanayeb34 Jul 20, 2023

Just to be clear on this, I thought Vits doesn't require a reference audio to synthesize speech. What is the essence of adding a speaker_wav when the model was fine tuned on the speaker's audio?

pivolan Jul 30, 2023

what about usage speaker_idx ? I also trained model on german and dutch, and use embedded and multi voice layer. And my models allow me to use both speaker_wav or speaker_idx.

chyntat · 2023-09-05T11:54:53Z

chyntat
Sep 5, 2023

Hi, We are trying to fine-tune with Hindi audio. Each epoch takes approximately 2.5 hrs. Can you share the machine configuration used to create this, and the time it took to fine-tune the model. Thanks..!

0 replies

AthiraSreenath · 2023-11-29T20:23:49Z

AthiraSreenath
Nov 29, 2023

@danablend Could you share the code, so I could refer and try to reproduce the same steps? It would be so helpful.

1 reply

Harrolee Dec 2, 2023

Unsure if these resources will help you:

KaikeWesleyReis · 2024-01-20T19:12:55Z

KaikeWesleyReis
Jan 20, 2024

Hi @danablend and others!
I used most of the discussion here to my own adventure: Generate a TTS model with Harbinger Voice (if you don't know, it's the main villain of mass effect series). The steps that I have done so far:

Download voice videos like this: Mass Effect: Legendary Edition - Harbinger (Reaper) Voice Lines
Cut the audio in samples of 1-3 seconds
Transcribe using AWS Transcribe
Review the transcriptions by myself
Basically, the dataset is compose of 332 samples of 1 - 3 seconds each from different cleaned sources. In total I have 13.5 minutes of audio (some audios are equals given the fact that I got the same phrase in different sources). To increase the quantity of samples, I have done some data augmentation:
For each audio source, I unify 2 to 3 audios into a single one. For example, consider that I have:
[00_01.wav, 00_02.wav, 00_03.wav, 00_04.wav]
Using a time window of 2, I have the following new samples: 00_01-00_02.wav / 00_02-00_03.wav / 00_03-00_04.wav
Using a time window of 3, I have the following new samples: 00_01-00_02-00_03.wav / 00_02-00_03-00_04.wav
I applied those two methods. Between each audio, I added a silence of 300 ms. I have not done this DA with audios of different source, just for precaution.
After that, I have in total: ~56 minutes of audio that compose my dataset for finetuning.

To finetune, I basically got the following code of TTS: train_vits.py and replicate in a google colab notebook.

The only preprocessing procedure applied was downsampling from 44100 to 22050, which is the sampling rate of VITS.

The model that I selected for fine tuning was: en/ljspeech/vits, starting by step 1.000.000 (the one that you download following this tutorial here).

To do the FT, I'm using two google accounts (when the GPU credits goes off, I jump into another account with the most updated checkpoint). Here is the HP parameters that I specify (follows that base parameters):
run_name = 'vits_harbinger_ft_v2'

EPOCHS = 1000
BATCH_SIZE = 32
BATCH_SIZE_EVAL = 16
LEARNING_RATE = 0.001
USE_PHONEME = False

Here are some samples, during the steps:
Step 1kk + 4081
Step 1kk + 24500

The challenges that I faced so far:

I try to do the same experiment, by following the hugging face tutorial with T5 and to be honest: Too complex to deal with several models.
I never got more than 170 epochs, because somehow the colab disconnect and I loose access to drive (where is located my audio files). I don't know how to fix it.
I noticed "some things": The models have some difficulties to speak words like: something, funny, fun, prevail, capitalism and so on. Some tricks that I'm doing so far: (1) break the sentences in 3-5 words to improve results (2) some words with can break, like "something" can be converted to "some thing" which have a better pronunciation.
I restarted I lot this fine tuning, here are the reasons: Did the mistake of doing upsampling from 44k to 16k (because of TTS-T5) and then 16k to 22k. This inserted some noise gaps in the audio.
Removed bad audios, including more than 5 minutes of different phrases that originally have a sr of 11025
I forgot to restore the base model (yes, you saw correctly, I just assume that vits_train.py did this under the hood)
For DA, besides the procedure described above, I tried to add faster audios (x1.5 version) of each audio. For some unknown reason, I dropped that idea after several restarts that I did it so far

Given @danablend results I keep the training until 1.050.000 steps is reached to see if I can improve my results.

0 replies

boringtaskai · 2024-01-24T06:32:42Z

boringtaskai
Jan 24, 2024

Cool!, thanks for sharing.

0 replies

elizahbeth97 · 2025-07-23T13:19:24Z

elizahbeth97
Jul 23, 2025

Using state-of-the-art voice cloning technology, AudioModify allows you to clone AI voice with high fidelity and natural expression. Perfect for faceless video creators, virtual assistants, and personalized audio projects, this tool brings voice customization to a whole new level. Experience next-gen audio tech at 👉 https://audiomodify.com/

0 replies

Best Procedure For Voice Cloning - My Experience So Far #2507

Uh oh!

Uh oh!

Scroll down to the replies to hear samples!

VITS vs YourTTS - the voice cloning showdown

Context

Here are the procedures I have been playing with for VITS & YourTTS

VITS Fine Tuning Procedure

YourTTS Fine Tuning Procedure

Existing Documentation

My Question For You

Replies: 10 comments · 17 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danablend Apr 13, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danablend Apr 27, 2023 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 10 comments 17 replies

danablend
Apr 13, 2023
Author

danablend Apr 27, 2023
Author