Zhikang Niu1,2, Shujie Hu3, Jeongsoo Choi4, Yushen Chen1,2, Peining Chen1, Pengcheng Zhu5, Yunting Yang5, Bowen Zhang5, Jian Zhao5, Chunhui Wang5, Xie Chen1,2
1MoE Key Lab of Artificial Intelligence, X-LANCE Lab, School of Computer Science,
Shanghai Jiao Tong University, China
2Shanghai Innovation Institute, China
3The Chinese University of Hong Kong, China
4Korea Advanced Institute of Science and Technology, South Korea
5Geely, China
🚀 [2025.10] We release all the code and pre-trained models(semantic-vae, acoustic-vae-dim16, and acoustic-vae-dim64) to promote the research of semantic-aligned VAE for speech synthesis after cleaning the code.
- Semantic Alignment: A novel VAE framework that utilizes semantic alignment regularization to mitigate the reconstruction-generation optimization dilemma in high-dimensional latent spaces.
- Plug and Play for VAE-based TTS: Semantic-VAE can be easily integrated into existing VAE-based TTS models, providing a simple yet effective way to improve their performance.
- Accelerated Training: Semantic-VAE significantly accelerates the convergence of VAE-based TTS models while maintaining the same inference speed as the original models.
- High-Quality Speech Generation: Semantic-VAE achieves high-quality speech generation with improved intelligibility and speaker similarity, making it suitable for various TTS applications.
# We recommend using conda to create a new environment.
conda create -n semantic-vae python=3.11
conda activate semantic-vae
git clone https://github.com/ZhikangNiu/Semantic-VAE.git
cd Semantic-VAE
# Install PyTorch >= 2.2.0, e.g.,
pip install torch==2.4.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
# Install audiotools: https://github.com/descriptinc/audiotools
pip install git+https://github.com/descriptinc/audiotools
# Install editable version of Semantic-VAE
pip install -e .
We use pre-trained SSL models (WavLM) to extract features for semantic alignment. You can use the following command to extract features after preparing the dataset.
bash extract_ssl_features.sh
If you want to train your custom VAE, simply modify the conf/svae/example.yml
file as a template and specify the path to your customized .yml file. And our training command is as follow (you can also check train.sh
):
torchrun --nproc_per_node 8 scripts/train.py --args.load custom_yml_path
You can download the model and follow the example in examples/example.py
to reconstruct audio using the pretrained model.
For multi-GPU processing, we also provide two utility scripts:
extract_latent.py
: Extracts latent representations from audio files.recon_latent_to_wave.py
: Reconstructs waveforms from latent codes.
Our work is built upon the following open-source projects descript-audio-codec, F5-TTS and BigVGAN. Thanks to the authors for their great work, and if you have any questions, you can first check them on their respective issues.
Our code is released under MIT License. If our work and codebase is useful for you, please cite as:
@article{niu2025semantic,
title={Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis},
author={Niu, Zhikang and Hu, Shujie and Choi, Jeongsoo and Chen, Yushen and Chen, Peining and Zhu, Pengcheng and Yang, Yunting and Zhang, Bowen and Zhao, Jian and Wang, Chunhui and others},
journal={arXiv preprint arXiv:2509.22167},
year={2025}
}