Skip to content

Commit 2ba665a

Browse files
yuekaizhangyuekaizyuekaiz
authored
Add F5-TTS with semantic token training results (k2-fsa#1880)
* add cosy token * update inference code * add extract cosy token * update results * add requirements.txt * update readme --------- Co-authored-by: yuekaiz <[email protected]> Co-authored-by: yuekaiz <[email protected]>
1 parent da597ad commit 2ba665a

File tree

8 files changed

+770
-28
lines changed

8 files changed

+770
-28
lines changed

egs/wenetspeech4tts/TTS/README.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,10 @@
1+
# Results
2+
| Model | Seed-TTS test_zh CER | Comment |
3+
|---------------------------------------|---------------------|--------|
4+
| [vall-e](./valle) | 4.33% | ~150M |
5+
| [f5-tts](./f5-tts) | 3.02% (16 steps) / 2.42% (32 steps) | F5-TTS-Small Config, ~155M |
6+
| [f5-tts-semantic-token](./f5-tts) | 1.79% (16 steps) | Using pretrained cosyvoice2 semantic tokens as inputs rather than text tokens, ~155M |
7+
18
# Introduction
29

310
[**WenetSpeech4TTS**](https://huggingface.co/datasets/Wenetspeech4TTS/WenetSpeech4TTS) is a multi-domain **Mandarin** corpus derived from the open-sourced [WenetSpeech](https://arxiv.org/abs/2110.03370) dataset.
@@ -131,6 +138,51 @@ accelerate launch f5-tts/infer.py --nfe 16 --model-path $model_path --manifest-f
131138
bash local/compute_wer.sh $output_dir $manifest
132139
```
133140

141+
# F5-TTS-Semantic-Token
142+
143+
./f5-tts contains the code for training F5-TTS-Semantic-Token. We replaced the text tokens in F5-TTS with pretrained cosyvoice2 semantic tokens. During inference, we use the pretrained CosyVoice2 LLM to predict the semantic tokens for target audios. We observed that this approach leads to faster convergence and improved prosody modeling results.
144+
145+
Generated samples and training logs of wenetspeech basic 7k hours data can be found [here](https://huggingface.co/yuekai/f5-tts-semantic-token-small-wenetspeech4tts-basic/tree/main).
146+
147+
Preparation:
148+
149+
```
150+
# extract cosyvoice2 semantic tokens
151+
bash prepare.sh --stage 5 --stop_stage 7
152+
```
153+
154+
The training command is given below:
155+
156+
```
157+
# docker: ghcr.io/swivid/f5-tts:main
158+
# pip install k2==1.24.4.dev20241030+cuda12.4.torch2.4.0 -f https://k2-fsa.github.io/k2/cuda.html
159+
# pip install kaldialign lhotse tensorboard bigvganinference sentencepiece
160+
161+
world_size=8
162+
exp_dir=exp/f5-tts-semantic-token-small
163+
python3 f5-tts/train.py --max-duration 700 --filter-min-duration 0.5 --filter-max-duration 20 \
164+
--num-buckets 6 --dtype "bfloat16" --save-every-n 5000 --valid-interval 10000 \
165+
--base-lr 1e-4 --warmup-steps 20000 --average-period 0 \
166+
--num-epochs 10 --start-epoch 1 --start-batch 0 \
167+
--num-decoder-layers 18 --nhead 12 --decoder-dim 768 \
168+
--exp-dir ${exp_dir} --world-size ${world_size} \
169+
--decay-steps 600000 --prefix wenetspeech4tts_cosy_token --use-cosyvoice-semantic-token True
170+
```
171+
172+
To inference with Icefall Wenetspeech4TTS trained F5-Small-Semantic-Token, use:
173+
```
174+
huggingface-cli login
175+
huggingface-cli download --local-dir ${exp_dir} yuekai/f5-tts-semantic-token-small-wenetspeech4tts-basic
176+
huggingface-cli download nvidia/bigvgan_v2_24khz_100band_256x --local-dir bigvgan_v2_24khz_100band_256x
177+
178+
split=test_zh
179+
model_path=f5-tts-small-wenetspeech4tts-basic/epoch-10-avg-5.pt
180+
181+
accelerate launch f5-tts/infer.py --nfe 16 --model-path $model_path --split-name $split --output-dir $output_dir --decoder-dim 768 --nhead 12 --num-decoder-layers 18 --use-cosyvoice-semantic-token True
182+
bash local/compute_wer.sh $output_dir $manifest
183+
```
184+
134185
# Credits
135186
- [VALL-E](https://github.com/lifeiteng/vall-e)
136187
- [F5-TTS](https://github.com/SWivid/F5-TTS)
188+
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)

0 commit comments

Comments
 (0)