Skip to content

[Not for merge] Add Emilia Training Recipe for Llasa (cosyvoice2 token) #1887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

yuekaizhang
Copy link
Collaborator

@yuekaizhang yuekaizhang commented Mar 3, 2025

Inspired by Llasa, this PR enables continued pretraining of the Qwen2 LLM for the CosyVoice2 semantic token prediction task.

The predicted semantic tokens can be used to generate audio with either the CosyVoice2 pretrained U-Net model or the DIT model PR.

LLM Model Flow matching Model Seed-TTS test_zh CER Comment
pretrained cosyvoice2 0.5B f5-tts-small (wenetspeech4tts 7k hours) 1.79% (16 steps) See PR
llasa_cosyvoice2_token 0.5B (Emilia_ZH 50k hours) f5-tts-small (wenetspeech4tts 7k hours) 1.81% (16 steps)

@yuekaizhang yuekaizhang requested a review from JinZr March 3, 2025 06:36
See https://arxiv.org/pdf/2407.05361.

> [!CAUTION]
> The next-gen Kaldi framework provides tools and models for generating high-quality, synthetic speech (Text-to-Speech, TTS).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the terms & conditions may have been taken from another framework & the name changed?
may be safest to just delete this . (Assuming we decide it makes sense to merge the PR overall, which we can discuss separately.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I copied from libritts recipe here https://github.com/k2-fsa/icefall/tree/master/egs/libritts/TTS#readme. Deleted now.

@danpovey
Copy link
Collaborator

danpovey commented Mar 3, 2025

This seems like good work, and it's nice that you want to include it in our collection of recipes.
I also find it quite interesting. But I'm trying to come up with a good justification why it should be included here,
other than the fact that we are also interested in the TTS task right now. I.e. are we OK with icefall being a collection of
recipes even in cases where they have very little in common?
(BTW I notice that the instructions direct the user installs the k2 package, but I doubt this is actually needed).
Regardless of whether we merge it (and I'm open to input from our team members and others on this issue), I'm happy to have the pull request left here as an accessible place for discussion about this recipe and so that we can easily find it.

@yuekaizhang
Copy link
Collaborator Author

This seems like good work, and it's nice that you want to include it in our collection of recipes. I also find it quite interesting. But I'm trying to come up with a good justification why it should be included here, other than the fact that we are also interested in the TTS task right now. I.e. are we OK with icefall being a collection of recipes even in cases where they have very little in common? (BTW I notice that the instructions direct the user installs the k2 package, but I doubt this is actually needed). Regardless of whether we merge it (and I'm open to input from our team members and others on this issue), I'm happy to have the pull request left here as an accessible place for discussion about this recipe and so that we can easily find it.

Thank you for your feedback! Indeed, the structure of this PR differs from other recipes in Icefall. Initially, I planned to implement it using Lhotse and Icefall training loops. However, I found it simpler to use the Hugging Face dataset and trainer since it's a language model token prediction task.

I have added a [Not for merge] tag so that people can still reference the results in the PR.

@yuekaizhang yuekaizhang changed the title Add Emilia Training Recipe for Llasa (cosyvoice2 token) [Not for merge] Add Emilia Training Recipe for Llasa (cosyvoice2 token) Mar 4, 2025
@yuekaizhang yuekaizhang removed the request for review from JinZr March 4, 2025 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants