-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
I’m trying to train CosyVoice2’s Flow-Matching model from scratch using the Dual Codec's code, since I need its streaming inference capability.
⚙️ Training Setup
Hardware: 8 × A800 GPUs
Training time:
1e-5 LR → trained for 3 days, stable but still not well-fitted (poor audio quality, inconsistent timbre).
1e-4 LR → trained for 1 day, but quickly leads to gradient explosion and no convergence.

Optimizer: AdamW
Dataset: internal speech dataset (not using LibriTTS, the scale of dataset is about 2000 hours)
📉 Observations
The 1e-5 model produces overly smooth and unclear results, while the 1e-4 model diverges rapidly.
Below are 1e-5 model's mel-spectrogram comparisons:
predict mel:
gt mel:
❓ Question
Has anyone managed to successfully train the Flow-Matching model from scratch (not fine-tuning pretrained weights)?
Any advice or experience on:
Choosing an appropriate learning rate or LR schedule
Using EMA, gradient clipping, or warmup strategies
Adjusting flow noise schedule or loss balancing to stabilize early training
Any hints would be greatly appreciated!