Replies: 1 comment
-
|
Hi @ahm3texe , the choices of the setup may come from how you consider your inference, two major cases:
In particular, these two directions are not binary. You can adjust the data proportions (uncensored / normalized) to some extent to achieve a certain degree of balance and coordination. To your three points in detail, we may offer some feedback and insights with v1_base_zh_en model just for reference:
Best wishes. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone, and thank you for creating such a powerful tool.
I am fine-tuning F5-TTS on a dataset of natural, conversational Turkish speech. My goal is to produce a highly realistic and human-like voice. For context, Turkish is a highly phonetic ("what you see is what you get") language, which makes the following transcription choices particularly impactful on the model's performance.
I've encountered several critical decision points and I'm unsure about the best practices, especially for an alignment-free model like F5-TTS. My assumption is that the model will try to map whatever is in the transcript directly to the audio, making these decisions crucial. I'm hoping the community can share their experiences and help create a sort of "cookbook" or guide for these common challenges.
Here are the specific problems I'm facing:
1. Handling Conversational Disfluencies:
Filler Words ('uh', 'um'): What is the most effective approach?
"it's uhm very important").<filler>").Non-Speech Sounds (Laughs, Coughs): Should these be transcribed with tokens like
[laugh]or ignored completely?2. Transcribing Standard Written Forms:
Spelled-Out Acronyms: For an acronym that is pronounced letter-by-letter (like "FBI" in English, or "TBMM" in my Turkish data), what's best?
"T B M M"(Seems safer for forcing individual letter pronunciation)."TE BE ME ME").Numbers, Dates, and Currencies: What is the most robust method?
"5","1990","50%","100₺"."five","nineteen ninety","fifty percent","one hundred Turkish lira"). This seems more consistent, but is it the standard practice?3. The Role of Punctuation:
For example, should I use a comma for every short pause and a period for a longer, sentence-ending pause, even if it's grammatically unconventional? My feeling is that this would better teach the model the rhythm of speech.
I would be incredibly grateful for any insights, rules of thumb, or experiences you could share on these topics. What has worked best for you in achieving a natural-sounding model with F5-TTS?
Thank you for your time and help
Beta Was this translation helpful? Give feedback.
All reactions