Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

Zhikang Niu^1,2, Shujie Hu³, Jeongsoo Choi⁴, Yushen Chen^1,2, Peining Chen¹, Pengcheng Zhu⁵, Yunting Yang⁵, Bowen Zhang⁵, Jian Zhao⁵, Chunhui Wang⁵, Xie Chen^1,2

¹MoE Key Lab of Artificial Intelligence, X-LANCE Lab, School of Computer Science,
Shanghai Jiao Tong University, China ²Shanghai Innovation Institute, China
³The Chinese University of Hong Kong, China
⁴Korea Advanced Institute of Science and Technology, South Korea ⁵Geely, China

📜 News

🚀 [2025.10] We release all the code and pre-trained models(semantic-vae, acoustic-vae-dim16, and acoustic-vae-dim64) to promote the research of semantic-aligned VAE for speech synthesis after cleaning the code.

💡 Highlights

Semantic Alignment: A novel VAE framework that utilizes semantic alignment regularization to mitigate the reconstruction-generation optimization dilemma in high-dimensional latent spaces.
Plug and Play for VAE-based TTS: Semantic-VAE can be easily integrated into existing VAE-based TTS models, providing a simple yet effective way to improve their performance.
Accelerated Training: Semantic-VAE significantly accelerates the convergence of VAE-based TTS models while maintaining the same inference speed as the original models.
High-Quality Speech Generation: Semantic-VAE achieves high-quality speech generation with improved intelligibility and speaker similarity, making it suitable for various TTS applications.

🛠️ Usage

1. Install environment and dependencies

# We recommend using conda to create a new environment.
conda create -n semantic-vae python=3.11
conda activate semantic-vae

git clone https://github.com/ZhikangNiu/Semantic-VAE.git
cd Semantic-VAE

# Install PyTorch >= 2.2.0, e.g.,
pip install torch==2.4.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124

# Install audiotools: https://github.com/descriptinc/audiotools
pip install git+https://github.com/descriptinc/audiotools

# Install editable version of Semantic-VAE
pip install -e .

2. Feature Extraction

We use pre-trained SSL models (WavLM) to extract features for semantic alignment. You can use the following command to extract features after preparing the dataset.

bash extract_ssl_features.sh

3. Train the model

If you want to train your custom VAE, simply modify the conf/svae/example.yml file as a template and specify the path to your customized .yml file. And our training command is as follow (you can also check train.sh):

torchrun --nproc_per_node 8 scripts/train.py --args.load custom_yml_path

4. Run Inference

You can download the model and follow the example in examples/example.py to reconstruct audio using the pretrained model.

For multi-GPU processing, we also provide two utility scripts:

extract_latent.py: Extracts latent representations from audio files.
recon_latent_to_wave.py: Reconstructs waveforms from latent codes.

❤️ Acknowledgments

Our work is built upon the following open-source projects descript-audio-codec, F5-TTS and BigVGAN. Thanks to the authors for their great work, and if you have any questions, you can first check them on their respective issues.

✒️ Citation and License

Our code is released under MIT License. If our work and codebase is useful for you, please cite as:

@article{niu2025semantic,
  title={Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis},
  author={Niu, Zhikang and Hu, Shujie and Choi, Jeongsoo and Chen, Yushen and Chen, Peining and Zhu, Pengcheng and Yang, Yunting and Zhang, Bowen and Zhao, Jian and Wang, Chunhui and others},
  journal={arXiv preprint arXiv:2509.22167},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ckpts		ckpts
conf		conf
dac		dac
examples		examples
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
extract_latent.sh		extract_latent.sh
extract_ssl_feature.sh		extract_ssl_feature.sh
recon_wave.sh		recon_wave.sh
setup.py		setup.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

📜 News

💡 Highlights

🛠️ Usage

1. Install environment and dependencies

2. Feature Extraction

3. Train the model

4. Run Inference

❤️ Acknowledgments

✒️ Citation and License

About

Uh oh!

Releases

Packages

Languages

ZhikangNiu/Semantic-VAE

Folders and files

Latest commit

History

Repository files navigation

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

📜 News

💡 Highlights

🛠️ Usage

1. Install environment and dependencies

2. Feature Extraction

3. Train the model

4. Run Inference

❤️ Acknowledgments

✒️ Citation and License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages