MeloTTS is a high-quality, open-source text-to-speech system developed by MyShell AI. It is built on top of the VITS/VITS2 architecture and uses BERT-based linguistic features to produce natural-sounding speech. MeloTTS supports multiple languages and is designed to be fast enough for real-time CPU inference.
Strengths of the original MeloTTS:
- High naturalness and expressiveness in synthesized speech
- Fast inference — runs in real-time even on CPU
- Lightweight and easy to deploy
- Supports multiple languages (English, Chinese, Japanese, Korean, Spanish, French)
- Permissive MIT license, suitable for both commercial and non-commercial use
Limitations of the original MeloTTS:
- Not natively optimized for Vietnamese phonology (tones, phonemes)
- The default English/multilingual phonemizer does not handle Vietnamese tones and diacritics correctly
- No built-in support for Vietnamese-specific linguistic preprocessing
MeloTTS Vietnamese is a version of MeloTTS specifically optimized for the Vietnamese language. It inherits the high-quality and fast-inference characteristics of the original model while introducing targeted improvements to handle the unique phonological properties of Vietnamese — including its 6 tones, complex vowel system, and syllable structure.
This model is designed to produce natural, accurate Vietnamese speech and can be easily fine-tuned on custom Vietnamese datasets.
- Uses underthesea for Vietnamese text segmentation
- Integrates PhoBERT (vinai/phobert-base-v2) to extract Vietnamese linguistic features
- Full support for Vietnamese language characteristics:
- 45 symbols (phonemes)
- 8 tones (7 tonal marks and 1 unmarked tone)
- All defined in
melo/text/symbols.py
- Text-to-phoneme conversion:
- Based on the Text2PhonemeSequence library
- An improved higher-performance version is available at Text2PhonemeFast
This model was fine-tuned from the base MeloTTS model by:
- Replacing phonemes not found in English/Vietnamese with Vietnamese-specific phonemes
- Specifically replacing Korean phonemes with their corresponding Vietnamese equivalents
- Adjusting model parameters to match Vietnamese phonetic characteristics
- GitHub: MeloTTS Vietnamese
- The model was trained on the Infore dataset, consisting of approximately 25 hours of speech
- Note on data quality: This dataset has several limitations including suboptimal voice quality, missing punctuation, and imprecise phonetic transcriptions. However, when trained on internal/private high-quality data, results are significantly better.
The pre-trained model can be downloaded from Hugging Face:
git clone https://github.com/manhcuong02/MeloTTS_Vietnamese.git
cd MeloTTS_Vietnamese
pip install -r requirements.txtDownload the model checkpoint and config from Hugging Face and place them in your desired directory.
Refer to the notebook test_infer.ipynb for a full example. Basic usage:
from melo.api import TTS
# Speed is adjustable
speed = 1.0
# You can set device to 'cpu', 'cuda', 'cuda:0', or 'mps'
device = "cuda:0" # Will automatically use GPU if available
# Load the Vietnamese TTS model
model = TTS(
language="VI",
device=device,
config_path="/path/to/config.json",
ckpt_path="/path/to/G_model.pth",
)
speaker_ids = model.hps.data.spk2id
# Convert text to speech
text = "Nhập văn bản tại đây"
output_path = "output.wav"
model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=speed, quiet=True)The full data preparation process is detailed in docs/training.md. At minimum, you need:
- Audio files (recommended sample rate: 44100 Hz)
- A metadata file in the following format:
path/to/audio_001.wav |<speaker_name>|<language_code>|<text_001> path/to/audio_002.wav |<speaker_name>|<language_code>|<text_002>
Run the preprocessing script to prepare training data:
python melo/preprocess_text.py \
--metadata /path/to/text_training.list \
--config_path /path/to/config.json \
--device cuda:0 \
--val-per-spk 10 \
--max-val-total 500Alternatively, use the shell script melo/preprocess_text.sh with appropriate parameters.
Follow the training instructions in docs/training.md.
The Vietnamese adaptation, code implementation, and fine-tuning of this model were developed by Nguyễn Mạnh Cường.
- GitHub: manhcuong02
- Repository: MeloTTS Vietnamese
Listen to sample outputs from the model:
"Buổi sáng ở thành phố bắt đầu bằng tiếng xe cộ nhộn nhịp và ánh nắng nhẹ xuyên qua những tòa nhà cao tầng."
"Người đi làm vội vã, học sinh ríu rít trò chuyện, còn quán cà phê góc phố thì thoang thoảng mùi thơm dễ chịu."
"Cuối cùng, hãy thử thì thầm một câu thật nhẹ nhàng, rồi bất ngờ chuyển sang giọng nói to, rõ và đầy năng lượng."
This project is licensed under the MIT License, consistent with the original MeloTTS project. It may be used for both commercial and non-commercial purposes.
This implementation is based on TTS, VITS, VITS2, and Bert-VITS2. We appreciate their outstanding work.
