Blueprints Hub | Documentation | Contributing
🤝 This Blueprint was a result of an EleutherAI <> mozilla.ai collaboration, as part of their work on Open Datasets for LLM Training.
Transcribe audio files using Speaches
This tutorial will guide you through setting up and using Speaches to transcribe audio files from the command line or a locally hosted demo UI. Speaches is an OpenAI API-compatible server that provides streaming transcription, translation, and speech generation capabilities.
You can find a demo of Speaches hosted on HuggingFace Spaces, ready to use here. Note that the hardware on that server is limited, so the speed of transcription might be underwhelming for larger variations of whisper-models.
Using Docker to run Speaches
We can use pre-built images of Speaches to bypass manual installation. This is the recommended way to run Speaches, as it simplifies the setup process and ensures that you have all the necessary dependencies. Note that you have a choice between using the CPU or GPU version depending on your hardware.
Create a volume to store the downloaded models (so they persist even if you restart the container):
sudo docker volume create hf-hub-cache
sudo docker run \
--rm \
--detach \
--publish 8000:8000 \
--name speaches \
--volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub \
--gpus=all \
ghcr.io/speaches-ai/speaches:latest-cuda
sudo docker run \
--rm \
--detach \
--publish 8000:8000 \
--name speaches \
--volume hf-hub-cache:/home/ubuntu/.cache/huggingface/hub \
ghcr.io/speaches-ai/speaches:latest-cpu
This will pull the necessary Docker image and start the Speaches server in the background. The server will be accessible at http://localhost:8000
.
When you're done, you can stop the Speaches server with:
sudo docker stop speaches
This will stop the container, but thanks to the volume we created, the downloaded models will be preserved for future use.
Speaches is designed to be compatible with the OpenAI API. To have more control over our output, we can use the OpenAI CLI as our Speaches client. Let's install the OpenAI command-line interface to interact with the Speaches server:
pip install openai
Even though Speaches doesn't require an API key, the OpenAI client does. Configure these environment variables:
# Set the base URL to your local Speaches server
export OPENAI_BASE_URL=http://localhost:8000/v1/
# Use any non-empty string as the API key
export OPENAI_API_KEY="cant-be-empty"
Note: The API key doesn't need to be a valid OpenAI key, but it cannot be empty due to the OpenAI client requirements.
Now you're ready to transcribe your first audio file. Make sure you have an audio file ready (we'll use sample.mp3
in this example).
openai api audio.transcriptions.create -m Systran/faster-whisper-medium -f sample.mp3 --response-format text > sample.txt
This command will:
- Connect to your local Speaches server
- Use the
Systran/faster-whisper-medium
model for transcription - Transcribe the
sample.mp3
file - Return the result as plain text
Speaches supports various options for transcription. Here are some useful ones:
-
Choosing a Different Model:
Speaches supports various Whisper models. For higher accuracy (but slower transcription):
openai api audio.transcriptions.create -m Systran/faster-whisper-large-v3 -f sample.mp3 --response-format text > sample.txt
For faster transcription (but potentially lower accuracy):
openai api audio.transcriptions.create -m Systran/faster-whisper-tiny -f sample.mp3 --response-format text > sample.txt
-
Getting JSON Output with Timestamps:
openai api audio.transcriptions.create -m Systran/faster-whisper-medium -f sample.mp3 --response-format verbose_json > sample.json
-
Specifying the Language (can improve accuracy):
openai api audio.transcriptions.create -m Systran/faster-whisper-medium -f sample_fr.mp3 --language fr --response-format text > sample_fr.txt
-
Creating SRT Subtitle Files
SRT (SubRip Text) files are commonly used for subtitles in videos. Speaches can generate these directly from your audio files:
openai api audio.transcriptions.create -m Systran/faster-whisper-medium -f sample.mp3 --response-format srt > sample.srt
Speaches can use any compatible fine-tuned Whisper model from Hugging Face which uses the ctranslate2 architecture: https://huggingface.co/models?other=ctranslate2.
openai api audio.transcriptions.create -m smcproject/vegam-whisper-medium-ml -f sample_ml.mp3 --language ml --response-format text > sample_ml.txt
For example, this model works particularly well for Malayalam audio because it was fine-tuned on Mozilla's Common Voice Malayalam subset.
For other languages or specific domains, you can search for fine-tuned Whisper models (faster-whisper variant!) on Hugging Face and use them in the same way.
- Ensure your audio file is in a supported format
- Check that your audio file actually contains speech
- Try a different model (e.g., switch from a language-specific model to a general one)
- Verify the Speaches container is running with
sudo docker ps
- Check the logs with
sudo docker logs speaches
- Make sure your OPENAI_BASE_URL is correct
- The first time you use a model, Speaches will download it automatically
- This might take some time depending on your internet connection and the model size
Once you're comfortable with basic transcription, you might want to explore other features of Speaches:
- Translation: Translating audio from one language to English
- Text-to-Speech: Converting text to spoken audio
- Voice Chat: Creating audio-based conversations with LLMs
Check out the Speaches documentation for more information on these advanced features.
- An audio file in a supported format:
mp3
,mp4
,mpeg
,mpga
,m4a
,wav
, andwebm
- System requirements:
- OS: Linux, macOS, Windows (WSL)
- Python 3.10 or higher
- Docker
- Minimum RAM: 16GB
- Disk space: 40GB
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.