Skip to content

Multi-speaker TTS with ESPnet mel-spectrograms #209

@migi-gon

Description

@migi-gon

Hello!

I have been following the system described in this paper by Y. Jia, et al Link. So far, I am done training the synthesizer module using ESPnet-Tacotron 2 multi-speaker tts scripts provided here: Link. I finished the training and resulted to intelligible speech, albeit robotic, using Griffin-Lim.

Now, in order to improve the synthesized outputs, I decided to train a wavenet vocoder using the synthesized mel-spectrograms (produced mel-specs of the train set) as described in the paper. I trained the model for 1000k steps and checked the output which resulted to garbled speech. I then extended the training (without changing the hparams) to 1600k steps but still no improvements. Sample synthesized audio files (and the hparams file) can be found here: Link.

Any help or insights on how I could continue would be very much appreciated. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions