MMS-TTS FastAPI Wrapper

This project provides a simple and efficient Text-to-Speech (TTS) service for multiple languages, utilizing Facebook's Massively Multilingual Speech (MMS) models. It features a lightweight FastAPI backend and a clean, easy-to-use web interface for speech generation.

Features

High-Quality TTS: Leverages state-of-the-art MMS models from Hugging Face.
FastAPI Backend: Provides a robust and fast API for TTS synthesis.
Simple Web UI: A static HTML/CSS/JS interface to interact with the TTS engine.
Command-Line Model Management: Add and download models easily through a configuration file and a script.
Dockerized: Includes a Dockerfile for easy deployment and scaling.
CPU and GPU Support: Can run on both CPU and GPU.

Project Structure

.
├── api
│   ├── ...
├── models
│   └── mms-tts-khm
│       ├── config.json
│       ├── ...
│       └── vocab.json
├── static
│   ├── ...
├── templates
│   └── ...
├── Dockerfile
├── download_models.py
├── models.json
├── README.md
└── requirements.txt

api/: Contains all backend source code.
models/: The default directory where downloaded models are stored. Each model has its own subdirectory.
static/: Contains the CSS and JavaScript files for the web UI.
templates/: Contains the index.html for the web UI.
download_models.py: A utility to pre-download models from the Hugging Face Hub.
models.json: A configuration file listing the available models.
requirements.txt: A list of Python dependencies.
Dockerfile: For building the Docker image.

Installation

Clone the repository:

git clone https://github.com/your-username/mms-tts-fastapi.git
cd mms-tts-fastapi

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install the dependencies:
```
pip install -r requirements.txt
```

Model Management

Managing models is a two-step process:

Add a model configuration: Define the model's properties.
Download the model files: Download the actual model data from the Hugging Face Hub.

1. Add a Model Configuration

You can add a model configuration in two ways: using the command-line script or by manually editing models.json.

Option A: Using the CLI (Recommended)

Use the add_model.py script to add a new model configuration.

Usage:

python add_model.py --id <model_id> --type <model_type> --huggingface-repo-id <repo_id> [--languages <lang1> <lang2> ...] [--voices <voice1> <voice2> ...]

Arguments:

--id: A unique identifier for the model (e.g., mms-fra).
--type: The model type (e.g., vits).
--huggingface-repo-id: The repository ID on the Hugging Face Hub (e.g., facebook/mms-tts-fra).
--languages: (Optional) A space-separated list of supported languages.
--voices: (Optional) A space-separated list of available voices.

Example:

python add_model.py --id mms-fra --type vits --huggingface-repo-id facebook/mms-tts-fra --languages fra

Option B: Manual Editing

You can also manually add a new JSON object to the models.json file.

Example models.json entry:

{
  "id": "mms-fra",
  "type": "vits",
  "huggingface_repo_id": "facebook/mms-tts-fra",
  "languages": ["fra"],
  "voices": []
}

2. Download the Model Files

After adding the model configuration, download its files from the Hugging Face Hub using the download_models.py script.

python download_models.py --model_name <huggingface_repo_id>

Replace <huggingface_repo_id> with the huggingface_repo_id from the models.json entry (e.g., facebook/mms-tts-fra). The script will download the model into the models/ directory.

Usage

To start the application, run:

uvicorn api.main:app --host 0.0.0.0 --port 8000

Then, open your web browser and navigate to http://localhost:8000/webui.

API Documentation

The application exposes the following API endpoints:

GET /models

Description: Returns a list of all available models configured in models.json and their download status.
Response: A JSON object containing a list of models.

Example Response:

{
  "models": [
    {
      "id": "mms-khm",
      "type": "vits",
      "downloaded": true,
      "languages": ["khm"],
      "voices": ["default"],
      "huggingface_repo_id": "facebook/mms-tts-khm"
    },
    {
      "id": "mms-eng",
      "type": "vits",
      "downloaded": false,
      "languages": ["eng"],
      "voices": ["default"],
      "huggingface_repo_id": "facebook/mms-tts-eng"
    }
  ]
}

POST /speech/generate

Description: Generates audio from text using the specified model.
Request Body: A JSON object with the following fields:
- model_id (str): The ID of the model to use (must be downloaded).
- text (str): The text to synthesize.
- device (str): The device to use for inference (cpu or cuda).
- language (str, optional): The language of the text.
- voice (str, optional): The voice to use for synthesis.
Response: A WAV audio file.

Example Request (using curl):

curl -X POST -H "Content-Type: application/json" \
-d '{"model_id": "mms-eng", "text": "Hello world", "device": "cpu"}' \
http://localhost:8000/speech/generate -o output.wav

A Note on TTS Model Architectures

"VITS" and other similar terms refer to different types of architectures for Text-to-Speech (TTS) models. Here's a breakdown of some of the most common ones:

1. VITS (Variational Inference with Text-to-Speech)

What it is: VITS is a modern, high-quality TTS architecture known for producing very natural and human-like speech.
How it works: It's an end-to-end model, meaning it handles the entire process from text to audio in a single, unified network. It cleverly combines a few advanced techniques (like Variational Autoencoders and Generative Adversarial Networks) to learn the relationship between text and speech in a very sophisticated way.

2. Tacotron 2

What it is: Tacotron 2 is another very popular and influential TTS architecture. For a long time, it was the standard for high-quality speech synthesis.
How it works: It's a two-stage process. First, Tacotron 2 converts your input text into a mel-spectrogram, which is a visual representation of sound. Then, a separate model called a vocoder (like WaveNet or Griffin-Lim) takes that spectrogram and converts it into an actual audio waveform.

3. FastSpeech 2

What it is: As the name suggests, FastSpeech 2 is designed for speed. It can generate speech much faster than models like Tacotron 2.
How it works: It's also a two-stage model that generates a spectrogram from text. However, it's "non-autoregressive," which means it can generate the entire spectrogram in parallel, rather than one piece at a time. This makes it incredibly fast and efficient, which is great for real-time applications.

Summary Table

Model Type	Key Feature
VITS	All-in-one, highly natural speech
Tacotron 2	Two-stage, high-quality standard
FastSpeech 2	Two-stage, extremely fast

In this project, you specify the model type (e.g., "vits") when adding a new model.

Docker

To build and run the application with Docker, you can use the provided Dockerfile.

CPU Build

docker build -t mms-tts-cpu .
docker run -p 8000:8000 mms-tts-cpu

GPU Build

For GPU support, you need to have the NVIDIA Container Toolkit installed.

docker build \
    --build-arg BASE_IMAGE=nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 \
    --build-arg TORCH_INSTALL_CMD="pip install --no-cache-dir -r requirements.txt" \
    -t mms-tts-gpu .

docker run --gpus all -p 8000:8000 mms-tts-gpu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMS-TTS FastAPI Wrapper

Features

Project Structure

Installation

Model Management

1. Add a Model Configuration

Option A: Using the CLI (Recommended)

Option B: Manual Editing

2. Download the Model Files

Usage

API Documentation

GET /models

POST /speech/generate

A Note on TTS Model Architectures

1. VITS (Variational Inference with Text-to-Speech)

2. Tacotron 2

3. FastSpeech 2

Summary Table

Docker

CPU Build

GPU Build

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
api		api
static		static
templates		templates
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
download_models.py		download_models.py
models.json		models.json
requirements.txt		requirements.txt

Aishete/mms-tts-khm-fastapi

Folders and files

Latest commit

History

Repository files navigation

MMS-TTS FastAPI Wrapper

Features

Project Structure

Installation

Model Management

1. Add a Model Configuration

Option A: Using the CLI (Recommended)

Option B: Manual Editing

2. Download the Model Files

Usage

API Documentation

GET /models

POST /speech/generate

A Note on TTS Model Architectures

1. VITS (Variational Inference with Text-to-Speech)

2. Tacotron 2

3. FastSpeech 2

Summary Table

Docker

CPU Build

GPU Build

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages