A Cog-based deployment of WhisperX for German speech-to-text transcription using the faster-whisper-large-v3-turbo model.
This repository packages WhisperX as a Replicate-compatible model, enabling easy deployment and inference via Cog. It uses:
- WhisperX (i4ds fork) for transcription with VAD (Voice Activity Detection)
- faster-whisper-large-v3-turbo model for fast, accurate German transcription
- Cog for containerization and deployment to Replicate
- Cog installed
- NVIDIA GPU with CUDA 12.1 support
- Docker
First, download the model to your Hugging Face cache. You can use the helper script:
python get_models.pyThis will download the model (i4ds/daily-brook-134) to your local Hugging Face cache.
Copy the cached model to the models/ directory:
./copy_models.shThis creates the following structure:
models/
└── faster-whisper-large-v3-turbo/
├── config.json
├── tokenizer.json
├── vocabulary.json
└── ...
cog buildcog predict -i audio_file=@your_audio.mp3| Parameter | Description | Default |
|---|---|---|
audio_file |
Audio file to transcribe | (required) |
language |
Language (fixed to German) | de |
batch_size |
Parallelization for transcription | 8 |
temperature |
Sampling temperature | 0 |
vad_onset |
VAD onset threshold | 0.500 |
vad_offset |
VAD offset threshold | 0.363 |
align_output |
Enable word-level timestamps | False |
debug |
Print timing/memory info | True |
cog predict -i [email protected]The prediction returns:
- segments: Transcription in SRT subtitle format
- detected_language: The detected language code (e.g.,
de)
├── cog.yaml # Cog configuration (CUDA, Python, dependencies)
├── predict.py # Main prediction class for Cog
├── requirements.txt # Python dependencies
├── copy_models.sh # Script to copy model from HF cache
├── get_vad_model_url.py # Helper to download model
└── models/ # Local model directory
└── faster-whisper-large-v3-turbo/
cog login
cog push r8.im/your-username/whisperx-german- The model is hardcoded to German (
de) transcription - Uses
float16compute type for GPU efficiency - VAD is enabled by default for better handling of speech segments