Fine-tuning strategy for emotional TTS using monotone audio book data #1091

lalimili6 · 2025-09-03T18:19:12Z

lalimili6
Sep 3, 2025

I have a 200-hour audio book dataset from a single speaker. I'm using Urdu speech.

The primary challenge is that the data has very limited emotional range:

The narration style is mostly monotone
Only contains basic question and surprise intonations
Lacks diverse emotional expressions (happy, sad, angry, etc.)

My goal is to create a high-quality TTS model that can:

Accurately reproduce the speaker's voice characteristics
Generate speech with a full range of emotional expressions beyond what's present in the training data

Question 1: Best approach for high-quality model with emotions
What is the recommended strategy to create an emotionally expressive model when training data lacks emotional diversity? Should I focus on

Training solely on my current dataset
Adding synthetic emotional data
Other approaches?

Question 2: Leveraging base model's emotional capabilities
Can I fine-tune the model on my dataset while leveraging the emotional capabilities already present in the base model?

If so, how can I effectively utilize the base model's knowledge of emotions that weren't in my training data?
Are there specific fine-tuning techniques recommended for preserving emotional expressiveness?

Question 3: Augmenting with external datasets
Would augmenting my dataset with external emotional speech data (like Mozilla Common Voice) improve the model's emotional capabilities? If yes, What is the recommended approach for combining datasets from different languages?
How much external data would be needed to see meaningful improvement?
Are there any risks or downsides to this approach?

Additional context or comments
I'm particularly interested in understanding if the cross-lingual emotion transfer capabilities of Fish Speech can be leveraged.
The base model's ability to generate emotions in various languages is impressive, and I'm hoping to preserve this capability while adapting to my specific speaker.
Can you help us with this feature?
I am interested in contributing to this feature.
This revised title and issue focuses on the universal technical problem of "creating an emotional model from non-emotional data," which is more likely to attract attention and useful responses from both the developers and the wider community, regardless of the specific language you're working with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine-tuning strategy for emotional TTS using monotone audio book data #1091

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Fine-tuning strategy for emotional TTS using monotone audio book data #1091

Uh oh!

lalimili6 Sep 3, 2025

Replies: 0 comments

lalimili6
Sep 3, 2025