GitHub - manhcuong02/MeloTTS_Vietnamese: High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.

Introduction

About MeloTTS

MeloTTS is a high-quality, open-source text-to-speech system developed by MyShell AI. It is built on top of the VITS/VITS2 architecture and uses BERT-based linguistic features to produce natural-sounding speech. MeloTTS supports multiple languages and is designed to be fast enough for real-time CPU inference.

Strengths of the original MeloTTS:

High naturalness and expressiveness in synthesized speech
Fast inference — runs in real-time even on CPU
Lightweight and easy to deploy
Supports multiple languages (English, Chinese, Japanese, Korean, Spanish, French)
Permissive MIT license, suitable for both commercial and non-commercial use

Limitations of the original MeloTTS:

Not natively optimized for Vietnamese phonology (tones, phonemes)
The default English/multilingual phonemizer does not handle Vietnamese tones and diacritics correctly
No built-in support for Vietnamese-specific linguistic preprocessing

MeloTTS Vietnamese

MeloTTS Vietnamese is a version of MeloTTS specifically optimized for the Vietnamese language. It inherits the high-quality and fast-inference characteristics of the original model while introducing targeted improvements to handle the unique phonological properties of Vietnamese — including its 6 tones, complex vowel system, and syllable structure.

This model is designed to produce natural, accurate Vietnamese speech and can be easily fine-tuned on custom Vietnamese datasets.

Technical Features

Uses underthesea for Vietnamese text segmentation
Integrates PhoBERT (vinai/phobert-base-v2) to extract Vietnamese linguistic features
Full support for Vietnamese language characteristics:
- 45 symbols (phonemes)
- 8 tones (7 tonal marks and 1 unmarked tone)
- All defined in melo/text/symbols.py
Text-to-phoneme conversion:
- Based on the Text2PhonemeSequence library
- An improved higher-performance version is available at Text2PhonemeFast

Fine-tuning from Base Model

This model was fine-tuned from the base MeloTTS model by:

Replacing phonemes not found in English/Vietnamese with Vietnamese-specific phonemes
Specifically replacing Korean phonemes with their corresponding Vietnamese equivalents
Adjusting model parameters to match Vietnamese phonetic characteristics
GitHub: MeloTTS Vietnamese

Training Data

The model was trained on the Infore dataset, consisting of approximately 25 hours of speech
Note on data quality: This dataset has several limitations including suboptimal voice quality, missing punctuation, and imprecise phonetic transcriptions. However, when trained on internal/private high-quality data, results are significantly better.

Downloading the Model

The pre-trained model can be downloaded from Hugging Face:

MeloTTS Vietnamese on Hugging Face

Usage Guide

Part 1: Inference

1. Clone the Repository and Install Dependencies

git clone https://github.com/manhcuong02/MeloTTS_Vietnamese.git
cd MeloTTS_Vietnamese
pip install -r requirements.txt

2. Download the Pre-trained Model

Download the model checkpoint and config from Hugging Face and place them in your desired directory.

3. Run Inference

Refer to the notebook test_infer.ipynb for a full example. Basic usage:

from melo.api import TTS

# Speed is adjustable
speed = 1.0

# You can set device to 'cpu', 'cuda', 'cuda:0', or 'mps'
device = "cuda:0"  # Will automatically use GPU if available

# Load the Vietnamese TTS model
model = TTS(
    language="VI",
    device=device,
    config_path="/path/to/config.json",
    ckpt_path="/path/to/G_model.pth",
)
speaker_ids = model.hps.data.spk2id

# Convert text to speech
text = "Nhập văn bản tại đây"
output_path = "output.wav"
model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=speed, quiet=True)

Part 2: Training & Fine-tuning

1. Data Preparation

The full data preparation process is detailed in docs/training.md. At minimum, you need:

Audio files (recommended sample rate: 44100 Hz)

A metadata file in the following format:

path/to/audio_001.wav |<speaker_name>|<language_code>|<text_001>
path/to/audio_002.wav |<speaker_name>|<language_code>|<text_002>

2. Data Preprocessing

Run the preprocessing script to prepare training data:

python melo/preprocess_text.py \
  --metadata /path/to/text_training.list \
  --config_path /path/to/config.json \
  --device cuda:0 \
  --val-per-spk 10 \
  --max-val-total 500

Alternatively, use the shell script melo/preprocess_text.sh with appropriate parameters.

3. Start Training

Follow the training instructions in docs/training.md.

Code & Fine-tuning

The Vietnamese adaptation, code implementation, and fine-tuning of this model were developed by Nguyễn Mạnh Cường.

GitHub: manhcuong02
Repository: MeloTTS Vietnamese

Audio Examples

Listen to sample outputs from the model:

Sample 1

"Buổi sáng ở thành phố bắt đầu bằng tiếng xe cộ nhộn nhịp và ánh nắng nhẹ xuyên qua những tòa nhà cao tầng."

Sample 2

"Người đi làm vội vã, học sinh ríu rít trò chuyện, còn quán cà phê góc phố thì thoang thoảng mùi thơm dễ chịu."

Sample 3

"Cuối cùng, hãy thử thì thầm một câu thật nhẹ nhàng, rồi bất ngờ chuyển sang giọng nói to, rõ và đầy năng lượng."

License

This project is licensed under the MIT License, consistent with the original MeloTTS project. It may be used for both commercial and non-commercial purposes.

Acknowledgements

This implementation is based on TTS, VITS, VITS2, and Bert-VITS2. We appreciate their outstanding work.

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.github/workflows		.github/workflows
docs		docs
melo		melo
samples		samples
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
logo.png		logo.png
requirements.txt		requirements.txt
setup.py		setup.py
test_infer.ipynb		test_infer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

About MeloTTS

MeloTTS Vietnamese

Technical Features

Fine-tuning from Base Model

Training Data

Downloading the Model

Usage Guide

Part 1: Inference

1. Clone the Repository and Install Dependencies

2. Download the Pre-trained Model

3. Run Inference

Part 2: Training & Fine-tuning

1. Data Preparation

2. Data Preprocessing

3. Start Training

Code & Fine-tuning

Audio Examples

Sample 1

Sample 2

Sample 3

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

About MeloTTS

MeloTTS Vietnamese

Technical Features

Fine-tuning from Base Model

Training Data

Downloading the Model

Usage Guide

Part 1: Inference

1. Clone the Repository and Install Dependencies

2. Download the Pre-trained Model

3. Run Inference

Part 2: Training & Fine-tuning

1. Data Preparation

2. Data Preprocessing

3. Start Training

Code & Fine-tuning

Audio Examples

Sample 1

Sample 2

Sample 3

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages