Best Procedure For Voice Cloning - My Experience So Far #2507
Replies: 10 comments 17 replies
-
@danablend thanks for sharing your thoughts! Do you mind posting some audio samples of the trained outputs. What are some of your fine tuning parameters for your |
Beta Was this translation helpful? Give feedback.
-
So I just skimmed over the YourTTS paper again, and it looks like the speaker_wav input is very important as it's used to extract the speaker embeddings to use as a reference voice during inference. In the YourTTS paper, they used a clear 20-word sentence (about 5-6s) from the speaker as speaker embedding reference (see here: https://edresson.github.io/YourTTS). Seemed to generate great results! Will play around with this speaker embedding stuff a bunch today - if anyone has some know-how on the best way to select a speaker embedding reference file to get great results, let us know? :) So the YourTTS procedure seems more clear? Something like this perhaps could give us some good results:
If anyone has any inputs or ways to do this better, would love to hear your insights! |
Beta Was this translation helpful? Give feedback.
-
@danablend Thanks for initiating this conversation. I agree with the findings mostly. I have taken Nanonomads tutorials by heart and I am training a hinglish model now. Can you help me with more resources on how to train VITS model? You mentioned there are enough resources on internet for that. |
Beta Was this translation helpful? Give feedback.
-
Hi @danablend, Regards, This the config settings: dataset_config = BaseDatasetConfig(
output.mp4 |
Beta Was this translation helpful? Give feedback.
-
@danablend @Harrolee
I get the ff error
One more thing. The dataset I used to train my model was low-pitched hence the volume of the synthesized speech is low. Can I modify any parameter in the audio config to increase the volume of the synthesized speech? |
Beta Was this translation helpful? Give feedback.
-
Hi, We are trying to fine-tune with Hindi audio. Each epoch takes approximately 2.5 hrs. Can you share the machine configuration used to create this, and the time it took to fine-tune the model. Thanks..! |
Beta Was this translation helpful? Give feedback.
-
@danablend Could you share the code, so I could refer and try to reproduce the same steps? It would be so helpful. |
Beta Was this translation helpful? Give feedback.
-
Hi @danablend and others!
To finetune, I basically got the following code of TTS: train_vits.py and replicate in a google colab notebook. The only preprocessing procedure applied was downsampling from 44100 to 22050, which is the sampling rate of VITS. The model that I selected for fine tuning was: en/ljspeech/vits, starting by step 1.000.000 (the one that you download following this tutorial here). To do the FT, I'm using two google accounts (when the GPU credits goes off, I jump into another account with the most updated checkpoint). Here is the HP parameters that I specify (follows that base parameters):
Here are some samples, during the steps: The challenges that I faced so far:
Given @danablend results I keep the training until 1.050.000 steps is reached to see if I can improve my results. |
Beta Was this translation helpful? Give feedback.
-
Cool!, thanks for sharing. |
Beta Was this translation helpful? Give feedback.
-
Using state-of-the-art voice cloning technology, AudioModify allows you to clone AI voice with high fidelity and natural expression. Perfect for faceless video creators, virtual assistants, and personalized audio projects, this tool brings voice customization to a whole new level. Experience next-gen audio tech at 👉 https://audiomodify.com/ |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Scroll down to the replies to hear samples!
VITS vs YourTTS - the voice cloning showdown
Let's get to the bottom of this, once and for all!
All models are mentioned here are in the English language.
Context
Been looking for the best framework to clone my voice on a limited amount of audio (20-25 minutes), while also being fast at training and high audio quality in the output.
Been playing with VITS and YourTTS for this so far, as the research papers suggest them to be the highest quality ones unless we're getting into the Microsoft Natural Speech paper from May 2022, but no public implementation for this exists thus far.
Here are the procedures I have been playing with for VITS & YourTTS
Settings I use to fine tune for both VITS & YourTTS:
VITS Fine Tuning Procedure
YourTTS Fine Tuning Procedure
If you know of better ways to do anything, I would be extremely happy to hear about your procedure for fine tuning on a new speaker!
Existing Documentation
The procedure to train the VITS model and fine tune it to a new speaker seems straight forward, and there is enough information out there to make it work.
I cannot say the same thing for YourTTS, which does not have as much documentation around it, since it was released so recently.
YourTTS seems to beat VITS in voice quality when fine tuning to new speakers, based on the research papers for the two:
From the YourTTS paper, it also suggests that much less data is required for high quality fine tuning on new speakers, which is a great upside.
Check NanoNomad's YouTube videos - they are great guides include Google Colabs. Very helpful if you're trying to get something to just run.
My Question For You
What are your experiences with VITS and YourTTS, and do you prefer one over the other? Would be very interested in the quality you experience, and how your procedure looks like for YourTTS as it's less documented?
Beta Was this translation helpful? Give feedback.
All reactions