Whisper ASR Box is a comprehensive speech recognition toolkit that supports multiple ASR engines and advanced features. The service offers:
Core Features:
- Live transcription via WebSocket for real-time speech recognition
- Multiple ASR engines: OpenAI Whisper, Faster Whisper, WhisperX, and NbAiLab Whisper
- Optimized Norwegian support with NbAiLab models for best quality on Norwegian speech
- Multiple output formats: text, JSON, VTT, SRT, TSV with word-level timestamps
- Speaker diarization with WhisperX to distinguish between different speakers
- Voice Activity Detection (VAD) for noise filtering
- GPU acceleration for faster processing
- FFmpeg integration for broad audio and video format support
- REST API with Swagger documentation
- Web-based live player for real-time transcription in the browser
Whisper models are trained on large datasets of diverse audio and can perform multilingual speech recognition, speech translation, and language identification.
Current release (v1.9.0-dev) supports following whisper models:
- openai/whisper@v20240930
- SYSTRAN/faster-whisper@v1.1.0
- whisperX@v3.1.1
- NbAiLab Whisper (via HuggingFace) (e.g.
NbAiLab/nb-whisper-large,NbAiLab/nb-whisper-small)
docker run -d -p 9000:9000 \
-e ASR_MODEL=NbAiLab/nb-whisper-large \
-e ASR_ENGINE=nbailab_whisper \
sasund/whisper-asr-webservice:latestdocker run -d -p 9000:9000 \
-e ASR_MODEL=base \
-e ASR_ENGINE=openai_whisper \
sasund/whisper-asr-webservice:latestdocker run -d -p 9000:9000 -e ASR_MODEL=NbAiLab/nb-whisper-large -e ASR_ENGINE=nbailab_whisper sasund/whisper-asr-webservice:latestAll supported models, including NbAiLab Whisper models, can run on GPU if you use a Docker image with GPU support and have the correct PyTorch installation.
docker run -d --gpus all -p 9000:9000 \
-e ASR_MODEL=NbAiLab/nb-whisper-large \
-e ASR_ENGINE=nbailab_whisper \
sasund/whisper-asr-webservice:latest-gpudocker run -d --gpus all -p 9000:9000 \
-e ASR_MODEL=base \
-e ASR_ENGINE=openai_whisper \
sasund/whisper-asr-webservice:latest-gpuTo reduce container startup time by avoiding repeated downloads, you can persist the cache directory:
docker run -d -p 9000:9000 \
-v $PWD/cache:/root/.cache/ \
sasund/whisper-asr-webservice:latestdocker run -d --gpus all -p 9000:9000 -e ASR_MODEL=NbAiLab/nb-whisper-large -e ASR_ENGINE=nbailab_whisper sasund/whisper-asr-webservice:latest-gpuThe project follows a modular architecture:
app/asr_models/- ASR engine implementationsapp/factory/- Factory pattern for ASR modelsapp/services/- Business logic layerapp/websockets/- WebSocket handlersapp/output/- Output formattersapp/exceptions/- Custom exceptions
# Install dev dependencies
poetry install --with dev
# Run all tests
pytest
# Run with coverage
pytest --cov=app tests/When using ASR_ENGINE=nbailab_whisper, you have access to a wide range of Norwegian-optimized models:
NbAiLab/nb-whisper-tiny- Fastest, smallest modelNbAiLab/nb-whisper-base- Good balance of speed and accuracyNbAiLab/nb-whisper-small- Better accuracy than baseNbAiLab/nb-whisper-medium- High accuracy, moderate speedNbAiLab/nb-whisper-large- Best accuracy, slower inference
NbAiLab/nb-whisper-tiny-beta- Latest tiny modelNbAiLab/nb-whisper-base-beta- Latest base modelNbAiLab/nb-whisper-small-beta- Latest small modelNbAiLab/nb-whisper-medium-beta- Latest medium modelNbAiLab/nb-whisper-large-beta- Latest large model
NbAiLab/nb-whisper-tiny-verbatim- Preserves pronunciation detailsNbAiLab/nb-whisper-base-verbatim- Preserves pronunciation detailsNbAiLab/nb-whisper-small-verbatim- Preserves pronunciation detailsNbAiLab/nb-whisper-medium-verbatim- Preserves pronunciation detailsNbAiLab/nb-whisper-large-verbatim- Preserves pronunciation details
NbAiLab/nb-whisper-tiny-semantic- Better context understandingNbAiLab/nb-whisper-base-semantic- Better context understandingNbAiLab/nb-whisper-small-semantic- Better context understandingNbAiLab/nb-whisper-medium-semantic- Better context understandingNbAiLab/nb-whisper-large-semantic- Better context understanding
# For production use (recommended)
export ASR_ENGINE=nbailab_whisper
export ASR_MODEL=NbAiLab/nb-whisper-large
# For faster inference
export ASR_MODEL=NbAiLab/nb-whisper-base
# For latest beta version
export ASR_MODEL=NbAiLab/nb-whisper-large-beta
# For preserving exact pronunciation
export ASR_MODEL=NbAiLab/nb-whisper-large-verbatim- Multiple ASR engines support (OpenAI Whisper, Faster Whisper, WhisperX, NbAiLab Whisper)
- Multiple output formats (text, JSON, VTT, SRT, TSV)
- Word-level timestamps support
- Voice activity detection (VAD) filtering
- Speaker diarization (with WhisperX)
- FFmpeg integration for broad audio/video format support
- GPU acceleration support
- Configurable model loading/unloading
- REST API with Swagger documentation
- Live transcription via WebSocket (new!)
- Optimized Norwegian language support with NbAiLab models
- Fixed NbAiLab Whisper implementation: Corrected HuggingFace pipeline usage for optimal Norwegian transcription
- Improved language detection: Enhanced Norwegian language detection with proper confidence scoring
- Fixed result formatting: Resolved compatibility issues with output writers for NbAiLab models
- Removed unsupported parameters: Cleaned up initial_prompt handling for HuggingFace compatibility
- Warning suppression: Eliminated transformers warnings for cleaner logs
- Better Norwegian transcription: NbAiLab models now provide significantly better quality for Norwegian speech
- Stable live transcription: Fixed WebSocket implementation for reliable real-time transcription
- Proper error handling: Improved error messages and exception handling
- Memory optimization: Better model loading and caching for HuggingFace models
# For best Norwegian transcription quality
export ASR_ENGINE=nbailab_whisper
export ASR_MODEL=NbAiLab/nb-whisper-large
# For faster processing with good quality
export ASR_ENGINE=nbailab_whisper
export ASR_MODEL=NbAiLab/nb-whisper-mediumThe service now supports real-time transcription via WebSocket. This allows you to send audio chunks and receive transcription results in real-time.
ws://localhost:9000/ws/live-transcribe
A demo client is included to test live transcription:
# Install websockets if not already installed
pip install websockets
# Run demo with a WAV file
python demo_live_transcribe.py path/to/your/audio.wav
# Run demo with an MP3 file (automatic conversion)
python demo_live_transcribe.py path/to/your/audio.mp3
# With language specification
python demo_live_transcribe.py audio.wav --language no
# Custom options
python demo_live_transcribe.py audio.wav --host localhost --port 9000 --chunk-duration 1.0 --language en- Supported formats: MP3, WAV, and all formats supported by FFmpeg
- Automatic conversion: Non-WAV files are automatically converted to WAV format
- Recommended settings: 16kHz, mono, 16-bit (automatic with conversion)
- FFmpeg required: For MP3 and other format support
You can specify the language for live transcription:
- Auto-detect (default): Let the model detect the language automatically
- Norwegian:
--language no - English:
--language en - Swedish:
--language sv - Danish:
--language da - And all other languages supported by Whisper
import asyncio
import websockets
async def live_transcribe():
# With language specification
uri = "ws://localhost:9000/ws/live-transcribe?language=no"
async with websockets.connect(uri) as websocket:
# Send audio chunks
await websocket.send(audio_chunk)
# Receive transcription
transcription = await websocket.recv()
print(f"Transcription: {transcription}")
asyncio.run(live_transcribe())A web-based live transcription player is included with Video.js integration:
http://localhost:9000/static/live_player.html
To play audio files with live transcription:
http://localhost:9000/static/audio_player.html
- Real-time microphone transcription via WebSocket
- Audio file playback with transcription via WebSocket
- Video.js player for media playback
- Language selection (Norwegian, English, Swedish, etc.)
- Live captions with timestamps
- Responsive design for desktop and mobile
- Audio processing with noise suppression and echo cancellation
- Drag & drop file upload for audio files
- Progress tracking for audio playback
- Start the whisper-asr-webservice
- Open
http://localhost:9000/static/live_player.htmlin your browser - Allow microphone access when prompted
- Click "Start Live Transcription" to begin
- Speak into your microphone and see real-time transcription
- Modern browser with WebSocket support
- Microphone access permission
- HTTPS required for microphone access in production
Key configuration options:
ASR_ENGINE: Engine selection (openai_whisper, faster_whisper, whisperx, nbailab_whisper)ASR_MODEL: Model selection (tiny, base, small, medium, large-v3, etc.)ASR_MODEL_PATH: Custom path to store/load modelsASR_DEVICE: Device selection (cuda, cpu)MODEL_IDLE_TIMEOUT: Timeout for model unloading
For complete documentation, visit: https://ahmetoner.github.io/whisper-asr-webservice
# Install poetry
pip3 install poetry
# Install dependencies
poetry install
# Run service
poetry run whisper-asr-webservice --host 0.0.0.0 --port 9000After starting the service, visit http://localhost:9000 or http://0.0.0.0:9000 in your browser to access the Swagger UI documentation and try out the API endpoints.