Multi-speaker TTS with ESPnet mel-spectrograms

Hello!

I have been following the system described in this paper by Y. Jia, et al [Link](https://arxiv.org/abs/1806.04558). So far, I am done training the synthesizer module using ESPnet-Tacotron 2 multi-speaker tts scripts provided here: [Link](https://github.com/espnet/espnet/tree/master/egs/libritts/tts1). I finished the training and resulted to intelligible speech, albeit robotic, using Griffin-Lim.

Now, in order to improve the synthesized outputs, I decided to train a wavenet vocoder using the synthesized mel-spectrograms (produced mel-specs of the train set) as described in the paper. I trained the model for 1000k steps and checked the output which resulted to garbled speech. I then extended the training (without changing the hparams) to 1600k steps but still no improvements. Sample synthesized audio files (and the hparams file) can be found here: [Link](https://drive.google.com/drive/folders/1XDcWc3FmOWqFMY4qDDhiDdlsE-yYP3Yo).

Any help or insights on how I could continue would be very much appreciated. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-speaker TTS with ESPnet mel-spectrograms #209

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi-speaker TTS with ESPnet mel-spectrograms #209

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions