Transcribe audio files using Speaches

Blueprints Hub | Documentation | Contributing

🤝 This Blueprint was a result of an EleutherAI <> mozilla.ai collaboration, as part of their work on Open Datasets for LLM Training.

The tools & methods showcased in this blueprint were also part of EleutherAI's work on the Common Pile 0.1.

Transcribe audio files using Speaches

This tutorial will guide you through setting up and using Speaches to transcribe audio files from the command line or a locally hosted demo UI. Speaches is an OpenAI API-compatible server that provides streaming transcription, translation, and speech generation capabilities.

Hosted demo on HF Spaces (CPU-only)

You can find a demo of Speaches hosted on HuggingFace Spaces, ready to use here. Note that the hardware on that server is limited, so the speed of transcription might be underwhelming for larger variations of whisper-models.

Using Docker to run Speaches

We can use pre-built images of Speaches to bypass manual installation. This is the recommended way to run Speaches, as it simplifies the setup process and ensures that you have all the necessary dependencies. Note that you have a choice between using the CPU or GPU version depending on your hardware.

Create a volume to store the downloaded models (so they persist even if you restart the container):

sudo docker volume create hf-hub-cache

For CUDA (if you have a compatible NVIDIA GPU):

sudo docker run \
  --rm \
  --detach \
  --publish 8000:8000 \
  --name speaches \
  --volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub \
  --gpus=all \
  ghcr.io/speaches-ai/speaches:latest-cuda

For CPU (if you don't have a compatible GPU):

sudo docker run \
  --rm \
  --detach \
  --publish 8000:8000 \
  --name speaches \
  --volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub \
  ghcr.io/speaches-ai/speaches:latest-cpu

This will pull the necessary Docker image and start the Speaches server in the background. The server will be accessible at http://localhost:8000.

When you're done, you can stop the Speaches server with:

sudo docker stop speaches

This will stop the container, but thanks to the volume we created, the downloaded models will be preserved for future use.

Using the CLI version

Speaches is designed to be compatible with the OpenAI API. To have more control over our output, we can use the OpenAI CLI as our Speaches client. Let's install the OpenAI command-line interface to interact with the Speaches server:

pip install openai

Set Up Environment Variables

Even though Speaches doesn't require an API key, the OpenAI client does. Configure these environment variables:

# Set the base URL to your local Speaches server
export OPENAI_BASE_URL=http://localhost:8000/v1/

# Use any non-empty string as the API key
export OPENAI_API_KEY="cant-be-empty"

Note: The API key doesn't need to be a valid OpenAI key, but it cannot be empty due to the OpenAI client requirements.

Transcribe an Audio File

Now you're ready to transcribe your first audio file. Make sure you have an audio file ready (we'll use sample.mp3 in this example).

Basic Transcription:

openai api audio.transcriptions.create -m Systran/faster-whisper-medium -f sample.mp3 --response-format text > sample.txt

This command will:

Connect to your local Speaches server
Use the Systran/faster-whisper-medium model for transcription
Transcribe the sample.mp3 file
Return the result as plain text

Advanced Options:

Speaches supports various options for transcription. Here are some useful ones:

Choosing a Different Model:

Speaches supports various Whisper models. For higher accuracy (but slower transcription):

openai api audio.transcriptions.create -m Systran/faster-whisper-large-v3 -f sample.mp3 --response-format text > sample.txt

For faster transcription (but potentially lower accuracy):

openai api audio.transcriptions.create -m Systran/faster-whisper-tiny -f sample.mp3 --response-format text > sample.txt

Getting JSON Output with Timestamps:

openai api audio.transcriptions.create -m Systran/faster-whisper-medium -f sample.mp3 --response-format verbose_json > sample.json

Specifying the Language (can improve accuracy):

openai api audio.transcriptions.create -m Systran/faster-whisper-medium -f sample_fr.mp3 --language fr --response-format text > sample_fr.txt

Creating SRT Subtitle Files

SRT (SubRip Text) files are commonly used for subtitles in videos. Speaches can generate these directly from your audio files:
```
openai api audio.transcriptions.create -m Systran/faster-whisper-medium -f sample.mp3 --response-format srt > sample.srt
```

Using custom Models from Hugging Face

Speaches can use any compatible fine-tuned Whisper model from Hugging Face which uses the ctranslate2 architecture: https://huggingface.co/models?other=ctranslate2.

openai api audio.transcriptions.create -m smcproject/vegam-whisper-medium-ml -f sample_ml.mp3 --language ml --response-format text > sample_ml.txt

For example, this model works particularly well for Malayalam audio because it was fine-tuned on Mozilla's Common Voice Malayalam subset.

For other languages or specific domains, you can search for fine-tuned Whisper models (faster-whisper variant!) on Hugging Face and use them in the same way.

Troubleshooting

No output from transcription:

Ensure your audio file is in a supported format
Check that your audio file actually contains speech
Try a different model (e.g., switch from a language-specific model to a general one)

Error connecting to the server:

Verify the Speaches container is running with sudo docker ps
Check the logs with sudo docker logs speaches
Make sure your OPENAI_BASE_URL is correct

Model download issues:

The first time you use a model, Speaches will download it automatically
This might take some time depending on your internet connection and the model size

Next Steps

Once you're comfortable with basic transcription, you might want to explore other features of Speaches:

Translation: Translating audio from one language to English
Text-to-Speech: Converting text to spoken audio
Voice Chat: Creating audio-based conversations with LLMs

Check out the Speaches documentation for more information on these advanced features.

Hardware requirements

An audio file in a supported format: mp3, mp4, mpeg, mpga, m4a, wav, and webm
System requirements:
- OS: Linux, macOS, Windows (WSL)
- Python 3.10 or higher
- Docker
- Minimum RAM: 16GB
- Disk space: 40GB

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
demo		demo
images		images
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transcribe audio files using Speaches

Hosted demo on HF Spaces (CPU-only)

Using Docker to run Speaches

For CUDA (if you have a compatible NVIDIA GPU):

For CPU (if you don't have a compatible GPU):

Using the CLI version

Set Up Environment Variables

Transcribe an Audio File

Basic Transcription:

Advanced Options:

Using custom Models from Hugging Face

Troubleshooting

No output from transcription:

Error connecting to the server:

Model download issues:

Next Steps

Hardware requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

License

mozilla-ai/speech-to-text

Folders and files

Latest commit

History

Repository files navigation

Transcribe audio files using Speaches

Hosted demo on HF Spaces (CPU-only)

Using Docker to run Speaches

For CUDA (if you have a compatible NVIDIA GPU):

For CPU (if you don't have a compatible GPU):

Using the CLI version

Set Up Environment Variables

Transcribe an Audio File

Basic Transcription:

Advanced Options:

Using custom Models from Hugging Face

Troubleshooting

No output from transcription:

Error connecting to the server:

Model download issues:

Next Steps

Hardware requirements

License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Packages