You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a 200-hour audio book dataset from a single speaker. I'm using Urdu speech.
The primary challenge is that the data has very limited emotional range:
The narration style is mostly monotone
Only contains basic question and surprise intonations
Lacks diverse emotional expressions (happy, sad, angry, etc.)
My goal is to create a high-quality TTS model that can:
Accurately reproduce the speaker's voice characteristics
Generate speech with a full range of emotional expressions beyond what's present in the training data
Question 1: Best approach for high-quality model with emotions
What is the recommended strategy to create an emotionally expressive model when training data lacks emotional diversity? Should I focus on
Training solely on my current dataset
Adding synthetic emotional data
Other approaches?
Question 2: Leveraging base model's emotional capabilities
Can I fine-tune the model on my dataset while leveraging the emotional capabilities already present in the base model?
If so, how can I effectively utilize the base model's knowledge of emotions that weren't in my training data?
Are there specific fine-tuning techniques recommended for preserving emotional expressiveness?
Question 3: Augmenting with external datasets
Would augmenting my dataset with external emotional speech data (like Mozilla Common Voice) improve the model's emotional capabilities? If yes, What is the recommended approach for combining datasets from different languages?
How much external data would be needed to see meaningful improvement?
Are there any risks or downsides to this approach?
Additional context or comments
I'm particularly interested in understanding if the cross-lingual emotion transfer capabilities of Fish Speech can be leveraged.
The base model's ability to generate emotions in various languages is impressive, and I'm hoping to preserve this capability while adapting to my specific speaker.
Can you help us with this feature?
I am interested in contributing to this feature.
This revised title and issue focuses on the universal technical problem of "creating an emotional model from non-emotional data," which is more likely to attract attention and useful responses from both the developers and the wider community, regardless of the specific language you're working with.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I have a 200-hour audio book dataset from a single speaker. I'm using Urdu speech.
The primary challenge is that the data has very limited emotional range:
The narration style is mostly monotone
Only contains basic question and surprise intonations
Lacks diverse emotional expressions (happy, sad, angry, etc.)
My goal is to create a high-quality TTS model that can:
Question 1: Best approach for high-quality model with emotions
What is the recommended strategy to create an emotionally expressive model when training data lacks emotional diversity? Should I focus on
Other approaches?
Question 2: Leveraging base model's emotional capabilities
Can I fine-tune the model on my dataset while leveraging the emotional capabilities already present in the base model?
If so, how can I effectively utilize the base model's knowledge of emotions that weren't in my training data?
Are there specific fine-tuning techniques recommended for preserving emotional expressiveness?
Question 3: Augmenting with external datasets
Would augmenting my dataset with external emotional speech data (like Mozilla Common Voice) improve the model's emotional capabilities? If yes, What is the recommended approach for combining datasets from different languages?
How much external data would be needed to see meaningful improvement?
Are there any risks or downsides to this approach?
Additional context or comments
I'm particularly interested in understanding if the cross-lingual emotion transfer capabilities of Fish Speech can be leveraged.
The base model's ability to generate emotions in various languages is impressive, and I'm hoping to preserve this capability while adapting to my specific speaker.
Can you help us with this feature?
I am interested in contributing to this feature.
This revised title and issue focuses on the universal technical problem of "creating an emotional model from non-emotional data," which is more likely to attract attention and useful responses from both the developers and the wider community, regardless of the specific language you're working with.
Beta Was this translation helpful? Give feedback.
All reactions