Whisper ASR Box

Whisper ASR Box is a comprehensive speech recognition toolkit that supports multiple ASR engines and advanced features. The service offers:

Core Features:

Live transcription via WebSocket for real-time speech recognition
Multiple ASR engines: OpenAI Whisper, Faster Whisper, WhisperX, and NbAiLab Whisper
Optimized Norwegian support with NbAiLab models for best quality on Norwegian speech
Multiple output formats: text, JSON, VTT, SRT, TSV with word-level timestamps
Speaker diarization with WhisperX to distinguish between different speakers
Voice Activity Detection (VAD) for noise filtering
GPU acceleration for faster processing
FFmpeg integration for broad audio and video format support
REST API with Swagger documentation
Web-based live player for real-time transcription in the browser

Whisper models are trained on large datasets of diverse audio and can perform multilingual speech recognition, speech translation, and language identification.

Features

Current release (v1.9.0-dev) supports following whisper models:

openai/whisper@v20240930
SYSTRAN/faster-whisper@v1.1.0
whisperX@v3.1.1
NbAiLab Whisper (via HuggingFace) (e.g. NbAiLab/nb-whisper-large, NbAiLab/nb-whisper-small)

Quick Usage

CPU (NbAiLab Whisper - Default for Norwegian)

docker run -d -p 9000:9000 \
  -e ASR_MODEL=NbAiLab/nb-whisper-large \
  -e ASR_ENGINE=nbailab_whisper \
  sasund/whisper-asr-webservice:latest

CPU (OpenAI/NbAiLab)

docker run -d -p 9000:9000 \
  -e ASR_MODEL=base \
  -e ASR_ENGINE=openai_whisper \
  sasund/whisper-asr-webservice:latest

CPU (NbAiLab Whisper via HuggingFace)

docker run -d -p 9000:9000 -e ASR_MODEL=NbAiLab/nb-whisper-large -e ASR_ENGINE=nbailab_whisper sasund/whisper-asr-webservice:latest

GPU (NbAiLab Whisper - Default for Norwegian)

All supported models, including NbAiLab Whisper models, can run on GPU if you use a Docker image with GPU support and have the correct PyTorch installation.

docker run -d --gpus all -p 9000:9000 \
  -e ASR_MODEL=NbAiLab/nb-whisper-large \
  -e ASR_ENGINE=nbailab_whisper \
  sasund/whisper-asr-webservice:latest-gpu

GPU (OpenAI/NbAiLab)

docker run -d --gpus all -p 9000:9000 \
  -e ASR_MODEL=base \
  -e ASR_ENGINE=openai_whisper \
  sasund/whisper-asr-webservice:latest-gpu

Cache

To reduce container startup time by avoiding repeated downloads, you can persist the cache directory:

docker run -d -p 9000:9000 \
  -v $PWD/cache:/root/.cache/ \
  sasund/whisper-asr-webservice:latest

GPU with NbAiLab Whisper

docker run -d --gpus all -p 9000:9000 -e ASR_MODEL=NbAiLab/nb-whisper-large -e ASR_ENGINE=nbailab_whisper sasund/whisper-asr-webservice:latest-gpu

Development & Testing

Project Structure

The project follows a modular architecture:

app/asr_models/ - ASR engine implementations
app/factory/ - Factory pattern for ASR models
app/services/ - Business logic layer
app/websockets/ - WebSocket handlers
app/output/ - Output formatters
app/exceptions/ - Custom exceptions

Running Tests

# Install dev dependencies
poetry install --with dev

# Run all tests
pytest

# Run with coverage
pytest --cov=app tests/

NbAiLab Whisper Models

When using ASR_ENGINE=nbailab_whisper, you have access to a wide range of Norwegian-optimized models:

Standard Models (Recommended)

NbAiLab/nb-whisper-tiny - Fastest, smallest model
NbAiLab/nb-whisper-base - Good balance of speed and accuracy
NbAiLab/nb-whisper-small - Better accuracy than base
NbAiLab/nb-whisper-medium - High accuracy, moderate speed
NbAiLab/nb-whisper-large - Best accuracy, slower inference

Beta Models (Latest versions)

NbAiLab/nb-whisper-tiny-beta - Latest tiny model
NbAiLab/nb-whisper-base-beta - Latest base model
NbAiLab/nb-whisper-small-beta - Latest small model
NbAiLab/nb-whisper-medium-beta - Latest medium model
NbAiLab/nb-whisper-large-beta - Latest large model

Verbatim Models (Preserves exact pronunciation)

NbAiLab/nb-whisper-tiny-verbatim - Preserves pronunciation details
NbAiLab/nb-whisper-base-verbatim - Preserves pronunciation details
NbAiLab/nb-whisper-small-verbatim - Preserves pronunciation details
NbAiLab/nb-whisper-medium-verbatim - Preserves pronunciation details
NbAiLab/nb-whisper-large-verbatim - Preserves pronunciation details

Semantic Models (Better understanding of context)

NbAiLab/nb-whisper-tiny-semantic - Better context understanding
NbAiLab/nb-whisper-base-semantic - Better context understanding
NbAiLab/nb-whisper-small-semantic - Better context understanding
NbAiLab/nb-whisper-medium-semantic - Better context understanding
NbAiLab/nb-whisper-large-semantic - Better context understanding

Example Usage

# For production use (recommended)
export ASR_ENGINE=nbailab_whisper
export ASR_MODEL=NbAiLab/nb-whisper-large

# For faster inference
export ASR_MODEL=NbAiLab/nb-whisper-base

# For latest beta version
export ASR_MODEL=NbAiLab/nb-whisper-large-beta

# For preserving exact pronunciation
export ASR_MODEL=NbAiLab/nb-whisper-large-verbatim

Key Features

Multiple ASR engines support (OpenAI Whisper, Faster Whisper, WhisperX, NbAiLab Whisper)
Multiple output formats (text, JSON, VTT, SRT, TSV)
Word-level timestamps support
Voice activity detection (VAD) filtering
Speaker diarization (with WhisperX)
FFmpeg integration for broad audio/video format support
GPU acceleration support
Configurable model loading/unloading
REST API with Swagger documentation
Live transcription via WebSocket (new!)
Optimized Norwegian language support with NbAiLab models

Recent Improvements (v1.9.0-dev)

Norwegian Language Quality Fixes

Fixed NbAiLab Whisper implementation: Corrected HuggingFace pipeline usage for optimal Norwegian transcription
Improved language detection: Enhanced Norwegian language detection with proper confidence scoring
Fixed result formatting: Resolved compatibility issues with output writers for NbAiLab models
Removed unsupported parameters: Cleaned up initial_prompt handling for HuggingFace compatibility
Warning suppression: Eliminated transformers warnings for cleaner logs

Quality Improvements

Better Norwegian transcription: NbAiLab models now provide significantly better quality for Norwegian speech
Stable live transcription: Fixed WebSocket implementation for reliable real-time transcription
Proper error handling: Improved error messages and exception handling
Memory optimization: Better model loading and caching for HuggingFace models

Recommended Configuration for Norwegian

# For best Norwegian transcription quality
export ASR_ENGINE=nbailab_whisper
export ASR_MODEL=NbAiLab/nb-whisper-large

# For faster processing with good quality
export ASR_ENGINE=nbailab_whisper
export ASR_MODEL=NbAiLab/nb-whisper-medium

Live Transcription

The service now supports real-time transcription via WebSocket. This allows you to send audio chunks and receive transcription results in real-time.

WebSocket Endpoint

ws://localhost:9000/ws/live-transcribe

Demo Client

A demo client is included to test live transcription:

# Install websockets if not already installed
pip install websockets

# Run demo with a WAV file
python demo_live_transcribe.py path/to/your/audio.wav

# Run demo with an MP3 file (automatic conversion)
python demo_live_transcribe.py path/to/your/audio.mp3

# With language specification
python demo_live_transcribe.py audio.wav --language no

# Custom options
python demo_live_transcribe.py audio.wav --host localhost --port 9000 --chunk-duration 1.0 --language en

Audio Requirements

Supported formats: MP3, WAV, and all formats supported by FFmpeg
Automatic conversion: Non-WAV files are automatically converted to WAV format
Recommended settings: 16kHz, mono, 16-bit (automatic with conversion)
FFmpeg required: For MP3 and other format support

Language Support

You can specify the language for live transcription:

Auto-detect (default): Let the model detect the language automatically
Norwegian: --language no
English: --language en
Swedish: --language sv
Danish: --language da
And all other languages supported by Whisper

Usage Example

import asyncio
import websockets

async def live_transcribe():
    # With language specification
    uri = "ws://localhost:9000/ws/live-transcribe?language=no"
    async with websockets.connect(uri) as websocket:
        # Send audio chunks
        await websocket.send(audio_chunk)
        
        # Receive transcription
        transcription = await websocket.recv()
        print(f"Transcription: {transcription}")

asyncio.run(live_transcribe())

Live Player

A web-based live transcription player is included with Video.js integration:

Access Live Player

http://localhost:9000/static/live_player.html

Audio Player with Live Transcription

To play audio files with live transcription:

http://localhost:9000/static/audio_player.html

Features

Real-time microphone transcription via WebSocket
Audio file playback with transcription via WebSocket
Video.js player for media playback
Language selection (Norwegian, English, Swedish, etc.)
Live captions with timestamps
Responsive design for desktop and mobile
Audio processing with noise suppression and echo cancellation
Drag & drop file upload for audio files
Progress tracking for audio playback

Usage

Start the whisper-asr-webservice
Open http://localhost:9000/static/live_player.html in your browser
Allow microphone access when prompted
Click "Start Live Transcription" to begin
Speak into your microphone and see real-time transcription

Browser Requirements

Modern browser with WebSocket support
Microphone access permission
HTTPS required for microphone access in production

Environment Variables

Key configuration options:

ASR_ENGINE: Engine selection (openai_whisper, faster_whisper, whisperx, nbailab_whisper)
ASR_MODEL: Model selection (tiny, base, small, medium, large-v3, etc.)
ASR_MODEL_PATH: Custom path to store/load models
ASR_DEVICE: Device selection (cuda, cpu)
MODEL_IDLE_TIMEOUT: Timeout for model unloading

Documentation

For complete documentation, visit: https://ahmetoner.github.io/whisper-asr-webservice

Development

# Install poetry
pip3 install poetry

# Install dependencies
poetry install

# Run service
poetry run whisper-asr-webservice --host 0.0.0.0 --port 9000

After starting the service, visit http://localhost:9000 or http://0.0.0.0:9000 in your browser to access the Swagger UI documentation and try out the API endpoints.

Credits

This software uses libraries from the FFmpeg project under the LGPLv2.1

Name		Name	Last commit message	Last commit date
Latest commit History 314 Commits
.github		.github
app		app
audio		audio
docs		docs
static		static
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
FIX_ENGLISH_TRANSCRIPTION.md		FIX_ENGLISH_TRANSCRIPTION.md
LICENCE		LICENCE
Makefile		Makefile
README.md		README.md
TESTPLAN.md		TESTPLAN.md
demo_live_transcribe.py		demo_live_transcribe.py
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run_tests.sh		run_tests.sh

License

sasund/whisper-asr-webservice

Folders and files

Latest commit

History

Repository files navigation

Whisper ASR Box

Features

Quick Usage

CPU (NbAiLab Whisper - Default for Norwegian)

CPU (OpenAI/NbAiLab)

CPU (NbAiLab Whisper via HuggingFace)

GPU (NbAiLab Whisper - Default for Norwegian)

GPU (OpenAI/NbAiLab)

Cache

GPU with NbAiLab Whisper

Development & Testing

Project Structure

Running Tests

NbAiLab Whisper Models

Standard Models (Recommended)

Beta Models (Latest versions)

Verbatim Models (Preserves exact pronunciation)

Semantic Models (Better understanding of context)

Example Usage

Key Features

Recent Improvements (v1.9.0-dev)

Norwegian Language Quality Fixes

Quality Improvements

Recommended Configuration for Norwegian

Live Transcription

WebSocket Endpoint

Demo Client

Audio Requirements

Language Support

Usage Example

Live Player

Access Live Player

Audio Player with Live Transcription

Features

Usage

Browser Requirements

Environment Variables

Documentation

Development

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages