Skip to content

Question about Flow-matching training from scratch #1625

@CharlesNii

Description

@CharlesNii

I’m trying to train CosyVoice2’s Flow-Matching model from scratch using the Dual Codec's code, since I need its streaming inference capability.
⚙️ Training Setup
Hardware: 8 × A800 GPUs
Training time:
1e-5 LR → trained for 3 days, stable but still not well-fitted (poor audio quality, inconsistent timbre).

Image

1e-4 LR → trained for 1 day, but quickly leads to gradient explosion and no convergence.
Image
Optimizer: AdamW
Dataset: internal speech dataset (not using LibriTTS, the scale of dataset is about 2000 hours)
📉 Observations
The 1e-5 model produces overly smooth and unclear results, while the 1e-4 model diverges rapidly.

Below are 1e-5 model's mel-spectrogram comparisons:

predict mel:

Image

gt mel:

Image

❓ Question
Has anyone managed to successfully train the Flow-Matching model from scratch (not fine-tuning pretrained weights)?
Any advice or experience on:

Choosing an appropriate learning rate or LR schedule

Using EMA, gradient clipping, or warmup strategies

Adjusting flow noise schedule or loss balancing to stabilize early training

Any hints would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions