Skip to content

fat2fast/VieNeu-TTS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

215 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VieNeu-TTS

GitHub Hugging Face Hugging Face Hugging Face Discord Open In Colab

Untitled

VieNeu-TTS is an advanced on-device Vietnamese Text-to-Speech (TTS) model with instant voice cloning.

Tip

Voice Cloning: All model variants (including GGUF) support instant voice cloning with just 3-5 seconds of reference audio.

This project features two core architectures trained on the VieNeu-TTS-1000h dataset:

  • VieNeu-TTS (0.5B): An enhanced model fine-tuned from the NeuTTS Air architecture for maximum stability.
  • VieNeu-TTS-0.3B: A specialized model trained from scratch, delivering 2x faster inference and ultra-low latency.

These represent a significant upgrade from the previous VieNeu-TTS-140h with the following improvements:

  • Enhanced pronunciation: More accurate and stable Vietnamese pronunciation
  • Code-switching support: Seamless transitions between Vietnamese and English
  • Better voice cloning: Higher fidelity and speaker consistency
  • Real-time synthesis: 24 kHz waveform generation on CPU or GPU
  • Multiple model formats: Support for PyTorch, GGUF Q4/Q8 (CPU optimized), and ONNX codec

VieNeu-TTS delivers production-ready speech synthesis fully offline.

Author: Phạm Nguyễn Ngọc Bảo


Screen.Recording.2025-12-10.201011.mp4

🔬 Model Overview

  • Backbone:
    • VieNeu-TTS (0.5B): Qwen-0.5B fine-tuned from NeuTTS Air.
    • VieNeu-TTS-0.3B: Custom 0.3B model trained from scratch, optimized for extreme speed (2x faster).
  • Audio codec: NeuCodec (torch implementation; ONNX & quantized variants supported)
  • Context window: 2,048 tokens shared by prompt text and speech tokens
  • Output watermark: Enabled by default
  • Training data: VieNeu-TTS-1000h — 443,641 curated Vietnamese samples (Used for both versions).

Model Variants

Model Format Device Quality Speed
VieNeu-TTS PyTorch GPU/CPU ⭐⭐⭐⭐⭐ Very Fast with lmdeploy
VieNeu-TTS-0.3B PyTorch GPU/CPU ⭐⭐⭐⭐ Ultra Fast (2x)
VieNeu-TTS-q8-gguf GGUF Q8 CPU/GPU ⭐⭐⭐⭐ Fast
VieNeu-TTS-q4-gguf GGUF Q4 CPU/GPU ⭐⭐⭐ Very Fast
VieNeu-TTS-0.3B-q8-gguf GGUF Q8 CPU/GPU ⭐⭐⭐⭐ Ultra Fast (1.5x)
VieNeu-TTS-0.3B-q4-gguf GGUF Q4 CPU/GPU ⭐⭐⭐ Extreme Speed (2x)

Recommendations:

  • GPU users: Use VieNeu-TTS (PyTorch) for best quality
  • CPU users: Use VieNeu-TTS-0.3B-q4-gguf for fastest inference or VieNeu-TTS-0.3B-q8-gguf for best CPU quality.
  • Streaming: Only GGUF models support streaming inference (Requires llama-cpp-python >= 0.3.16)

✅ Todo & Status

  • Publish safetensor artifacts
  • Release GGUF Q4 / Q8 models
  • Release datasets (1000h and 140h)
  • Enable streaming on GPU
  • Provide Dockerized setup
  • Release fine-tuning code (LoRA)
  • LoRA Adapter integration in Gradio

🌟 New Feature: LoRA Adapters

VieNeu-TTS now officially supports LoRA (Low-Rank Adaptation). This allows you to:

  • Use custom fine-tuned voices from Hugging Face.
  • Achieve much higher quality and similarity than zero-shot voice cloning.
  • Switch between different adapters seamlessly in the Gradio UI.

For more details, see docs/LORA_USAGE.md.


🛠️ Fine-tuning

You can now train VieNeu-TTS on your own voice dataset!

  • Simple Workflow: Follow the step-by-step guide in finetune/README.md.
  • Notebook Support: Use finetune/finetune_VieNeu-TTS.ipynb for an interactive experience.

🏁 Getting Started

1. Clone the repository

git clone https://github.com/pnnbao97/VieNeu-TTS.git
cd VieNeu-TTS

2. Install eSpeak NG (Required)

Phonemizer requires eSpeak NG to function.

  • Windows: Download installer from eSpeak NG Releases (Recommended: .msi).
  • macOS: brew install espeak
  • Ubuntu/Debian: sudo apt install espeak-ng
  • Arch Linux: paru -S aur/espeak-ng

3. Environment Setup (Choose ONE method)

Method 1: Standard with uv (Recommended)

This is the fastest and most reliable way to manage dependencies.

A. Install uv (If you haven't already):

  • Windows: powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
  • Linux/macOS: curl -LsSf https://astral.sh/uv/install.sh | sh

B. Choose your hardware:

Option A: For GPU Users (NVIDIA 30xx/40xx/50xx)

Important

Update your NVIDIA Drivers & Install CUDA Toolkit! This project uses CUDA 12.8. Please ensure your NVIDIA driver is up-to-date (support CUDA 12.8 or newer) to avoid compatibility issues, especially on RTX 30 series.

To use lmdeploy, you MUST install the NVIDIA GPU Computing Toolkit: https://developer.nvidia.com/cuda-downloads.

uv sync

Option B: For CPU-only Users

  1. Switch to CPU configuration:
    # Windows:
    ren pyproject.toml pyproject.toml.bak
    copy pyproject.toml.cpu pyproject.toml
    
    # Linux/macOS:
    mv pyproject.toml pyproject.toml.bak
    cp pyproject.toml.cpu pyproject.toml
  2. Install dependencies:
    uv sync

C. Run the Application:

uv run gradio_app.py

Then access the Web UI at http://127.0.0.1:7860.


Method 2: Automatic with Makefile (Alternative)

Best if you have make installed (standard on Linux/macOS, or via Git Bash on Windows). It handles configuration swaps automatically.

  • Setup GPU: make setup-gpu
  • Setup CPU: make setup-cpu
  • Run Demo: make demo

Then access the Web UI at http://127.0.0.1:7860.



🐋 Docker Deployment

For a quick start or production deployment without manually installing dependencies, use Docker.

Quick Start

Copy .env.example to .env

cp .env.example .env

Build and start container

# Run with CPU
docker compose --profile cpu up

# Run with GPU (requires NVIDIA Container Toolkit)
docker compose --profile gpu up

Access the Web UI at http://localhost:7860.

For detailed deployment instructions, including production setup, see docs/Deploy.md.


📦 Project Structure

VieNeu-TTS/
├── vieneu_tts/            # Core engine implementation (VieNeuTTS & FastVieNeuTTS)
├── finetune/              # LoRA training pipeline
│   ├── configs/           # Training & LoRA configurations
│   ├── data_scripts/      # Data filtering & VQ encoding tools
│   ├── dataset/           # Training data storage
│   ├── output/            # Saved checkpoints & LoRA adapters
│   └── train.py           # Main training script
├── utils/                 # Text normalization and phonemization logic
├── sample/                # Built-in reference voices (audio + transcript + codes)
├── docs/                  # Detailed documentation for LoRA, Deployment, and Docker
├── examples/              # Usage examples and testing audio references
├── gradio_app.py          # Modern Web UI with LoRA & Streaming support
├── config.yaml            # Model, Codec, and Voice registry
├── pyproject.toml         # Dependency management (UV/PIP)
├── Makefile               # Shortcuts for setup and execution
└── docker-compose.yml     # Docker orchestration for CPU/GPU modes

📚 References


📄 License

  • VieNeu-TTS (0.5B): Original terms (Apache 2.0).
  • VieNeu-TTS-0.3B: Released under CC BY-NC 4.0 (Non-Commercial).
    • This version is currently experimental.
    • Commercial use is prohibited without authorization. Please contact the author for commercial licensing.

📑 Citation

@misc{vieneutts2026,
  title        = {VieNeu-TTS: Vietnamese Text-to-Speech with Instant Voice Cloning},
  author       = {Pham Nguyen Ngoc Bao},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS}}
}

🤝 Contributing

Contributions are welcome!

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit your changes: git commit -m "Add amazing feature"
  4. Push the branch: git push origin feature/amazing-feature
  5. Open a pull request

📞 Support


🙏 Acknowledgements

This project builds upon NeuTTS Air for the original 0.5B model. The 0.3B version is a custom architecture trained from scratch using the VieNeu-TTS-1000h dataset.


Made with ❤️ for the Vietnamese TTS community

About

Vietnamese TTS with instant voice cloning • On-device • Real-time CPU inference • 24kHz audio quality

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 73.7%
  • Python 24.2%
  • Makefile 2.1%