Replies: 1 comment 1 reply
-
|
Hi @Brugio96! I actually have been working a little on a ControlNet for audio, specifically for audio inpainting. Not directly the text-to-speech task, but it applies the similar idea of using a ControlNet on model trained to generate mel-spectrograms. Check it out if you're curious or want to use it as inspiration! Code here --> https://github.com/zachary-shah/riff-cnet (sorry my repo is a little bit messy right now >.>) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, by reading about ControlNet, I wonder if it would be suitable for a text-to-speech task.
Instead of general images, the network should output mel-spectrograms, the input to the locked copy would be the text to synthesize and the input to the trainable copy should be for example a suitable representation of a reference speech (but we can think about other possibilities).
I was wondering if anyone has thought about this and thinks that it could be possible.
It is just an idea thrown at you, but I think it might work best with a ControlNet-a-like approach applied to diffusion models already trained to generate mel-spectrograms.
This is just a high-level idea that could be elaborated further if it sparks any creative ideas on your end.
Beta Was this translation helpful? Give feedback.
All reactions