This project provides a simple and efficient Text-to-Speech (TTS) service for multiple languages, utilizing Facebook's Massively Multilingual Speech (MMS) models. It features a lightweight FastAPI backend and a clean, easy-to-use web interface for speech generation.
- High-Quality TTS: Leverages state-of-the-art MMS models from Hugging Face.
- FastAPI Backend: Provides a robust and fast API for TTS synthesis.
- Simple Web UI: A static HTML/CSS/JS interface to interact with the TTS engine.
- Command-Line Model Management: Add and download models easily through a configuration file and a script.
- Dockerized: Includes a Dockerfile for easy deployment and scaling.
- CPU and GPU Support: Can run on both CPU and GPU.
.
├── api
│ ├── ...
├── models
│ └── mms-tts-khm
│ ├── config.json
│ ├── ...
│ └── vocab.json
├── static
│ ├── ...
├── templates
│ └── ...
├── Dockerfile
├── download_models.py
├── models.json
├── README.md
└── requirements.txt
api/: Contains all backend source code.models/: The default directory where downloaded models are stored. Each model has its own subdirectory.static/: Contains the CSS and JavaScript files for the web UI.templates/: Contains theindex.htmlfor the web UI.download_models.py: A utility to pre-download models from the Hugging Face Hub.models.json: A configuration file listing the available models.requirements.txt: A list of Python dependencies.Dockerfile: For building the Docker image.
-
Clone the repository:
git clone https://github.com/your-username/mms-tts-fastapi.git cd mms-tts-fastapi -
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate -
Install the dependencies:
pip install -r requirements.txt
Managing models is a two-step process:
- Add a model configuration: Define the model's properties.
- Download the model files: Download the actual model data from the Hugging Face Hub.
You can add a model configuration in two ways: using the command-line script or by manually editing models.json.
Use the add_model.py script to add a new model configuration.
Usage:
python add_model.py --id <model_id> --type <model_type> --huggingface-repo-id <repo_id> [--languages <lang1> <lang2> ...] [--voices <voice1> <voice2> ...]Arguments:
--id: A unique identifier for the model (e.g.,mms-fra).--type: The model type (e.g.,vits).--huggingface-repo-id: The repository ID on the Hugging Face Hub (e.g.,facebook/mms-tts-fra).--languages: (Optional) A space-separated list of supported languages.--voices: (Optional) A space-separated list of available voices.
Example:
python add_model.py --id mms-fra --type vits --huggingface-repo-id facebook/mms-tts-fra --languages fraYou can also manually add a new JSON object to the models.json file.
Example models.json entry:
{
"id": "mms-fra",
"type": "vits",
"huggingface_repo_id": "facebook/mms-tts-fra",
"languages": ["fra"],
"voices": []
}After adding the model configuration, download its files from the Hugging Face Hub using the download_models.py script.
python download_models.py --model_name <huggingface_repo_id>Replace <huggingface_repo_id> with the huggingface_repo_id from the models.json entry (e.g., facebook/mms-tts-fra). The script will download the model into the models/ directory.
To start the application, run:
uvicorn api.main:app --host 0.0.0.0 --port 8000Then, open your web browser and navigate to http://localhost:8000/webui.
The application exposes the following API endpoints:
- Description: Returns a list of all available models configured in
models.jsonand their download status. - Response: A JSON object containing a list of models.
Example Response:
{
"models": [
{
"id": "mms-khm",
"type": "vits",
"downloaded": true,
"languages": ["khm"],
"voices": ["default"],
"huggingface_repo_id": "facebook/mms-tts-khm"
},
{
"id": "mms-eng",
"type": "vits",
"downloaded": false,
"languages": ["eng"],
"voices": ["default"],
"huggingface_repo_id": "facebook/mms-tts-eng"
}
]
}- Description: Generates audio from text using the specified model.
- Request Body: A JSON object with the following fields:
model_id(str): The ID of the model to use (must be downloaded).text(str): The text to synthesize.device(str): The device to use for inference (cpuorcuda).language(str, optional): The language of the text.voice(str, optional): The voice to use for synthesis.
- Response: A WAV audio file.
Example Request (using curl):
curl -X POST -H "Content-Type: application/json" \
-d '{"model_id": "mms-eng", "text": "Hello world", "device": "cpu"}' \
http://localhost:8000/speech/generate -o output.wav"VITS" and other similar terms refer to different types of architectures for Text-to-Speech (TTS) models. Here's a breakdown of some of the most common ones:
- What it is: VITS is a modern, high-quality TTS architecture known for producing very natural and human-like speech.
- How it works: It's an end-to-end model, meaning it handles the entire process from text to audio in a single, unified network. It cleverly combines a few advanced techniques (like Variational Autoencoders and Generative Adversarial Networks) to learn the relationship between text and speech in a very sophisticated way.
- What it is: Tacotron 2 is another very popular and influential TTS architecture. For a long time, it was the standard for high-quality speech synthesis.
- How it works: It's a two-stage process. First, Tacotron 2 converts your input text into a mel-spectrogram, which is a visual representation of sound. Then, a separate model called a vocoder (like WaveNet or Griffin-Lim) takes that spectrogram and converts it into an actual audio waveform.
- What it is: As the name suggests, FastSpeech 2 is designed for speed. It can generate speech much faster than models like Tacotron 2.
- How it works: It's also a two-stage model that generates a spectrogram from text. However, it's "non-autoregressive," which means it can generate the entire spectrogram in parallel, rather than one piece at a time. This makes it incredibly fast and efficient, which is great for real-time applications.
| Model Type | Key Feature |
|---|---|
| VITS | All-in-one, highly natural speech |
| Tacotron 2 | Two-stage, high-quality standard |
| FastSpeech 2 | Two-stage, extremely fast |
In this project, you specify the model type (e.g., "vits") when adding a new model.
To build and run the application with Docker, you can use the provided Dockerfile.
docker build -t mms-tts-cpu .
docker run -p 8000:8000 mms-tts-cpuFor GPU support, you need to have the NVIDIA Container Toolkit installed.
docker build \
--build-arg BASE_IMAGE=nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 \
--build-arg TORCH_INSTALL_CMD="pip install --no-cache-dir -r requirements.txt" \
-t mms-tts-gpu .
docker run --gpus all -p 8000:8000 mms-tts-gpu