-
Notifications
You must be signed in to change notification settings - Fork 333
[Not for merge] Add Emilia Training Recipe for Llasa (cosyvoice2 token) #1887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
egs/emilia/TTS/README.md
Outdated
See https://arxiv.org/pdf/2407.05361. | ||
|
||
> [!CAUTION] | ||
> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the terms & conditions may have been taken from another framework & the name changed?
may be safest to just delete this . (Assuming we decide it makes sense to merge the PR overall, which we can discuss separately.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I copied from libritts recipe here https://github.com/k2-fsa/icefall/tree/master/egs/libritts/TTS#readme. Deleted now.
This seems like good work, and it's nice that you want to include it in our collection of recipes. |
Thank you for your feedback! Indeed, the structure of this PR differs from other recipes in Icefall. Initially, I planned to implement it using Lhotse and Icefall training loops. However, I found it simpler to use the Hugging Face dataset and trainer since it's a language model token prediction task. I have added a [Not for merge] tag so that people can still reference the results in the PR. |
Inspired by Llasa, this PR enables continued pretraining of the Qwen2 LLM for the CosyVoice2 semantic token prediction task.
The predicted semantic tokens can be used to generate audio with either the CosyVoice2 pretrained U-Net model or the DIT model PR.