Given that the model was pre-trained on a massive 100-million-hour audio dataset, it's reasonable to assume that this data would contain a significant amount of non-vocal audio, such as sound effects and music.
In theory, this should enable the model to synthesize not just speech, but also various sound effects and musical elements. However, in my current tests, I have not been able to achieve this.
I am very interested to know if you have explored the model's potential for music and sound effect generation. Could you please share any insights or plans in this area?