Skip to content

Aishete/mms-tts-khm-fastapi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMS-TTS FastAPI Wrapper

This project provides a simple and efficient Text-to-Speech (TTS) service for multiple languages, utilizing Facebook's Massively Multilingual Speech (MMS) models. It features a lightweight FastAPI backend and a clean, easy-to-use web interface for speech generation.

Features

  • High-Quality TTS: Leverages state-of-the-art MMS models from Hugging Face.
  • FastAPI Backend: Provides a robust and fast API for TTS synthesis.
  • Simple Web UI: A static HTML/CSS/JS interface to interact with the TTS engine.
  • Command-Line Model Management: Add and download models easily through a configuration file and a script.
  • Dockerized: Includes a Dockerfile for easy deployment and scaling.
  • CPU and GPU Support: Can run on both CPU and GPU.

Project Structure

.
├── api
│   ├── ...
├── models
│   └── mms-tts-khm
│       ├── config.json
│       ├── ...
│       └── vocab.json
├── static
│   ├── ...
├── templates
│   └── ...
├── Dockerfile
├── download_models.py
├── models.json
├── README.md
└── requirements.txt
  • api/: Contains all backend source code.
  • models/: The default directory where downloaded models are stored. Each model has its own subdirectory.
  • static/: Contains the CSS and JavaScript files for the web UI.
  • templates/: Contains the index.html for the web UI.
  • download_models.py: A utility to pre-download models from the Hugging Face Hub.
  • models.json: A configuration file listing the available models.
  • requirements.txt: A list of Python dependencies.
  • Dockerfile: For building the Docker image.

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/mms-tts-fastapi.git
    cd mms-tts-fastapi
  2. Create and activate a virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate
  3. Install the dependencies:

    pip install -r requirements.txt

Model Management

Managing models is a two-step process:

  1. Add a model configuration: Define the model's properties.
  2. Download the model files: Download the actual model data from the Hugging Face Hub.

1. Add a Model Configuration

You can add a model configuration in two ways: using the command-line script or by manually editing models.json.

Option A: Using the CLI (Recommended)

Use the add_model.py script to add a new model configuration.

Usage:

python add_model.py --id <model_id> --type <model_type> --huggingface-repo-id <repo_id> [--languages <lang1> <lang2> ...] [--voices <voice1> <voice2> ...]

Arguments:

  • --id: A unique identifier for the model (e.g., mms-fra).
  • --type: The model type (e.g., vits).
  • --huggingface-repo-id: The repository ID on the Hugging Face Hub (e.g., facebook/mms-tts-fra).
  • --languages: (Optional) A space-separated list of supported languages.
  • --voices: (Optional) A space-separated list of available voices.

Example:

python add_model.py --id mms-fra --type vits --huggingface-repo-id facebook/mms-tts-fra --languages fra

Option B: Manual Editing

You can also manually add a new JSON object to the models.json file.

Example models.json entry:

{
  "id": "mms-fra",
  "type": "vits",
  "huggingface_repo_id": "facebook/mms-tts-fra",
  "languages": ["fra"],
  "voices": []
}

2. Download the Model Files

After adding the model configuration, download its files from the Hugging Face Hub using the download_models.py script.

python download_models.py --model_name <huggingface_repo_id>

Replace <huggingface_repo_id> with the huggingface_repo_id from the models.json entry (e.g., facebook/mms-tts-fra). The script will download the model into the models/ directory.

Usage

To start the application, run:

uvicorn api.main:app --host 0.0.0.0 --port 8000

Then, open your web browser and navigate to http://localhost:8000/webui.

API Documentation

The application exposes the following API endpoints:

GET /models

  • Description: Returns a list of all available models configured in models.json and their download status.
  • Response: A JSON object containing a list of models.

Example Response:

{
  "models": [
    {
      "id": "mms-khm",
      "type": "vits",
      "downloaded": true,
      "languages": ["khm"],
      "voices": ["default"],
      "huggingface_repo_id": "facebook/mms-tts-khm"
    },
    {
      "id": "mms-eng",
      "type": "vits",
      "downloaded": false,
      "languages": ["eng"],
      "voices": ["default"],
      "huggingface_repo_id": "facebook/mms-tts-eng"
    }
  ]
}

POST /speech/generate

  • Description: Generates audio from text using the specified model.
  • Request Body: A JSON object with the following fields:
    • model_id (str): The ID of the model to use (must be downloaded).
    • text (str): The text to synthesize.
    • device (str): The device to use for inference (cpu or cuda).
    • language (str, optional): The language of the text.
    • voice (str, optional): The voice to use for synthesis.
  • Response: A WAV audio file.

Example Request (using curl):

curl -X POST -H "Content-Type: application/json" \
-d '{"model_id": "mms-eng", "text": "Hello world", "device": "cpu"}' \
http://localhost:8000/speech/generate -o output.wav

A Note on TTS Model Architectures

"VITS" and other similar terms refer to different types of architectures for Text-to-Speech (TTS) models. Here's a breakdown of some of the most common ones:

1. VITS (Variational Inference with Text-to-Speech)

  • What it is: VITS is a modern, high-quality TTS architecture known for producing very natural and human-like speech.
  • How it works: It's an end-to-end model, meaning it handles the entire process from text to audio in a single, unified network. It cleverly combines a few advanced techniques (like Variational Autoencoders and Generative Adversarial Networks) to learn the relationship between text and speech in a very sophisticated way.

2. Tacotron 2

  • What it is: Tacotron 2 is another very popular and influential TTS architecture. For a long time, it was the standard for high-quality speech synthesis.
  • How it works: It's a two-stage process. First, Tacotron 2 converts your input text into a mel-spectrogram, which is a visual representation of sound. Then, a separate model called a vocoder (like WaveNet or Griffin-Lim) takes that spectrogram and converts it into an actual audio waveform.

3. FastSpeech 2

  • What it is: As the name suggests, FastSpeech 2 is designed for speed. It can generate speech much faster than models like Tacotron 2.
  • How it works: It's also a two-stage model that generates a spectrogram from text. However, it's "non-autoregressive," which means it can generate the entire spectrogram in parallel, rather than one piece at a time. This makes it incredibly fast and efficient, which is great for real-time applications.

Summary Table

Model Type Key Feature
VITS All-in-one, highly natural speech
Tacotron 2 Two-stage, high-quality standard
FastSpeech 2 Two-stage, extremely fast

In this project, you specify the model type (e.g., "vits") when adding a new model.

Docker

To build and run the application with Docker, you can use the provided Dockerfile.

CPU Build

docker build -t mms-tts-cpu .
docker run -p 8000:8000 mms-tts-cpu

GPU Build

For GPU support, you need to have the NVIDIA Container Toolkit installed.

docker build \
    --build-arg BASE_IMAGE=nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 \
    --build-arg TORCH_INSTALL_CMD="pip install --no-cache-dir -r requirements.txt" \
    -t mms-tts-gpu .

docker run --gpus all -p 8000:8000 mms-tts-gpu

About

a fastapi wrapper for mms-tts model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages