VieNeu-TTS is an advanced on-device Vietnamese Text-to-Speech (TTS) model with instant voice cloning.
Tip
Voice Cloning: All model variants (including GGUF) support instant voice cloning with just 3-5 seconds of reference audio.
This project features two core architectures trained on the VieNeu-TTS-1000h dataset:
- VieNeu-TTS (0.5B): An enhanced model fine-tuned from the NeuTTS Air architecture for maximum stability.
- VieNeu-TTS-0.3B: A specialized model trained from scratch, delivering 2x faster inference and ultra-low latency.
These represent a significant upgrade from the previous VieNeu-TTS-140h with the following improvements:
- Enhanced pronunciation: More accurate and stable Vietnamese pronunciation
- Code-switching support: Seamless transitions between Vietnamese and English
- Better voice cloning: Higher fidelity and speaker consistency
- Real-time synthesis: 24 kHz waveform generation on CPU or GPU
- Multiple model formats: Support for PyTorch, GGUF Q4/Q8 (CPU optimized), and ONNX codec
VieNeu-TTS delivers production-ready speech synthesis fully offline.
Author: Phạm Nguyễn Ngọc Bảo
Screen.Recording.2025-12-10.201011.mp4
- Backbone:
- VieNeu-TTS (0.5B): Qwen-0.5B fine-tuned from NeuTTS Air.
- VieNeu-TTS-0.3B: Custom 0.3B model trained from scratch, optimized for extreme speed (2x faster).
- Audio codec: NeuCodec (torch implementation; ONNX & quantized variants supported)
- Context window: 2,048 tokens shared by prompt text and speech tokens
- Output watermark: Enabled by default
- Training data: VieNeu-TTS-1000h — 443,641 curated Vietnamese samples (Used for both versions).
| Model | Format | Device | Quality | Speed |
|---|---|---|---|---|
| VieNeu-TTS | PyTorch | GPU/CPU | ⭐⭐⭐⭐⭐ | Very Fast with lmdeploy |
| VieNeu-TTS-0.3B | PyTorch | GPU/CPU | ⭐⭐⭐⭐ | Ultra Fast (2x) |
| VieNeu-TTS-q8-gguf | GGUF Q8 | CPU/GPU | ⭐⭐⭐⭐ | Fast |
| VieNeu-TTS-q4-gguf | GGUF Q4 | CPU/GPU | ⭐⭐⭐ | Very Fast |
| VieNeu-TTS-0.3B-q8-gguf | GGUF Q8 | CPU/GPU | ⭐⭐⭐⭐ | Ultra Fast (1.5x) |
| VieNeu-TTS-0.3B-q4-gguf | GGUF Q4 | CPU/GPU | ⭐⭐⭐ | Extreme Speed (2x) |
Recommendations:
- GPU users: Use
VieNeu-TTS(PyTorch) for best quality - CPU users: Use
VieNeu-TTS-0.3B-q4-gguffor fastest inference orVieNeu-TTS-0.3B-q8-gguffor best CPU quality. - Streaming: Only GGUF models support streaming inference (Requires
llama-cpp-python >= 0.3.16)
- Publish safetensor artifacts
- Release GGUF Q4 / Q8 models
- Release datasets (1000h and 140h)
- Enable streaming on GPU
- Provide Dockerized setup
- Release fine-tuning code (LoRA)
- LoRA Adapter integration in Gradio
VieNeu-TTS now officially supports LoRA (Low-Rank Adaptation). This allows you to:
- Use custom fine-tuned voices from Hugging Face.
- Achieve much higher quality and similarity than zero-shot voice cloning.
- Switch between different adapters seamlessly in the Gradio UI.
For more details, see docs/LORA_USAGE.md.
You can now train VieNeu-TTS on your own voice dataset!
- Simple Workflow: Follow the step-by-step guide in finetune/README.md.
- Notebook Support: Use
finetune/finetune_VieNeu-TTS.ipynbfor an interactive experience.
git clone https://github.com/pnnbao97/VieNeu-TTS.git
cd VieNeu-TTSPhonemizer requires eSpeak NG to function.
- Windows: Download installer from eSpeak NG Releases (Recommended:
.msi). - macOS:
brew install espeak - Ubuntu/Debian:
sudo apt install espeak-ng - Arch Linux:
paru -S aur/espeak-ng
This is the fastest and most reliable way to manage dependencies.
A. Install uv (If you haven't already):
- Windows:
powershell -c "irm https://astral.sh/uv/install.ps1 | iex" - Linux/macOS:
curl -LsSf https://astral.sh/uv/install.sh | sh
B. Choose your hardware:
Option A: For GPU Users (NVIDIA 30xx/40xx/50xx)
Important
Update your NVIDIA Drivers & Install CUDA Toolkit! This project uses CUDA 12.8. Please ensure your NVIDIA driver is up-to-date (support CUDA 12.8 or newer) to avoid compatibility issues, especially on RTX 30 series.
To use lmdeploy, you MUST install the NVIDIA GPU Computing Toolkit: https://developer.nvidia.com/cuda-downloads.
uv syncOption B: For CPU-only Users
- Switch to CPU configuration:
# Windows: ren pyproject.toml pyproject.toml.bak copy pyproject.toml.cpu pyproject.toml # Linux/macOS: mv pyproject.toml pyproject.toml.bak cp pyproject.toml.cpu pyproject.toml
- Install dependencies:
uv sync
C. Run the Application:
uv run gradio_app.pyThen access the Web UI at http://127.0.0.1:7860.
Best if you have make installed (standard on Linux/macOS, or via Git Bash on Windows). It handles configuration swaps automatically.
- Setup GPU:
make setup-gpu - Setup CPU:
make setup-cpu - Run Demo:
make demo
Then access the Web UI at http://127.0.0.1:7860.
For a quick start or production deployment without manually installing dependencies, use Docker.
Copy .env.example to .env
cp .env.example .env
Build and start container
# Run with CPU
docker compose --profile cpu up
# Run with GPU (requires NVIDIA Container Toolkit)
docker compose --profile gpu upAccess the Web UI at http://localhost:7860.
For detailed deployment instructions, including production setup, see docs/Deploy.md.
VieNeu-TTS/
├── vieneu_tts/ # Core engine implementation (VieNeuTTS & FastVieNeuTTS)
├── finetune/ # LoRA training pipeline
│ ├── configs/ # Training & LoRA configurations
│ ├── data_scripts/ # Data filtering & VQ encoding tools
│ ├── dataset/ # Training data storage
│ ├── output/ # Saved checkpoints & LoRA adapters
│ └── train.py # Main training script
├── utils/ # Text normalization and phonemization logic
├── sample/ # Built-in reference voices (audio + transcript + codes)
├── docs/ # Detailed documentation for LoRA, Deployment, and Docker
├── examples/ # Usage examples and testing audio references
├── gradio_app.py # Modern Web UI with LoRA & Streaming support
├── config.yaml # Model, Codec, and Voice registry
├── pyproject.toml # Dependency management (UV/PIP)
├── Makefile # Shortcuts for setup and execution
└── docker-compose.yml # Docker orchestration for CPU/GPU modes
- GitHub Repository
- Hugging Face Model (0.5B)
- Hugging Face Model (0.3B)
- LoRA Usage Guide
- Fine-tuning Guide
- VieNeu-TTS-1000h dataset
- VieNeu-TTS (0.5B): Original terms (Apache 2.0).
- VieNeu-TTS-0.3B: Released under CC BY-NC 4.0 (Non-Commercial).
- This version is currently experimental.
- Commercial use is prohibited without authorization. Please contact the author for commercial licensing.
@misc{vieneutts2026,
title = {VieNeu-TTS: Vietnamese Text-to-Speech with Instant Voice Cloning},
author = {Pham Nguyen Ngoc Bao},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS}}
}Contributions are welcome!
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit your changes:
git commit -m "Add amazing feature" - Push the branch:
git push origin feature/amazing-feature - Open a pull request
- GitHub Issues: github.com/pnnbao97/VieNeu-TTS/issues
- Hugging Face: huggingface.co/pnnbao-ump
- Discord: Join with us
- Facebook: Phạm Nguyễn Ngọc Bảo
This project builds upon NeuTTS Air for the original 0.5B model. The 0.3B version is a custom architecture trained from scratch using the VieNeu-TTS-1000h dataset.
Made with ❤️ for the Vietnamese TTS community