Checks
Environment Details
Python 3.10.10/3.12
Ubuntu 22.04
omnivoice - 0.1.5
Steps to Reproduce
while running the same text, even with the default voices 2/10 generations skipped the laughter emotion.
'''
किताब खोलते ही नींद आ जाती है [laughter]
'''
This same text generated using voice cloning skips the laughter tag 6/10 times. It onlly executes the laughter 4 or less times. The audio reference was an 8 second audio with corresponding text as below
Reference_audio
monica_lal_8sec.wav
Reference_text
'"कूड़े-कबाड़ की भी एक कहानी है। पर यह कहानी सुनाने से पहले एक कहावत दोहरानी पड़ेगी। अंग्रेज़ी में कहते हैं..."
I have tried with different voices as well as reference text. I have also tried with reference audio containing[laughter] in the reference audio and text, and the performance is similar.
The performance on other tags is slightly worse. I have tried with a lot of different texts combinations, but the performance is generally the same.
For this to be fixed do I need to fine-tune the main model more and what kind of data would I need to fix the emotion tag issues, would this need to include the original data also?
Any guidance on this would be appreciated! Thanks in advance.
✔️ Expected Behavior
Should execute the laughter consistently across all generations
❌ Actual Behavior
Misses laughter in 2/10 generations in case of default voices and 6/10 times in case of voice cloning.
Checks
Environment Details
Python 3.10.10/3.12
Ubuntu 22.04
omnivoice - 0.1.5
Steps to Reproduce
while running the same text, even with the default voices 2/10 generations skipped the laughter emotion.
'''
किताब खोलते ही नींद आ जाती है [laughter]
'''
This same text generated using voice cloning skips the laughter tag 6/10 times. It onlly executes the laughter 4 or less times. The audio reference was an 8 second audio with corresponding text as below
Reference_audio
monica_lal_8sec.wav
Reference_text
'"कूड़े-कबाड़ की भी एक कहानी है। पर यह कहानी सुनाने से पहले एक कहावत दोहरानी पड़ेगी। अंग्रेज़ी में कहते हैं..."
I have tried with different voices as well as reference text. I have also tried with reference audio containing[laughter] in the reference audio and text, and the performance is similar.
The performance on other tags is slightly worse. I have tried with a lot of different texts combinations, but the performance is generally the same.
For this to be fixed do I need to fine-tune the main model more and what kind of data would I need to fix the emotion tag issues, would this need to include the original data also?
Any guidance on this would be appreciated! Thanks in advance.
✔️ Expected Behavior
Should execute the laughter consistently across all generations
❌ Actual Behavior
Misses laughter in 2/10 generations in case of default voices and 6/10 times in case of voice cloning.