- Integrated Silero Voice Activity Detection (VAD)
- Automatically detects and skips alert tones, beeps, and leading silence
- Preserves 150ms buffer before speech starts
- Improves transcription accuracy for radio dispatch and pager recordings
- No configuration required - works automatically
detect_speech_start_sec_silero(): Analyzes 16kHz mono audio to find speech start- Uses 250ms minimum speech duration with 0.5 confidence threshold
- Falls back gracefully if VAD fails
- ~2MB model downloaded once on first run
torchaudio>=2.0.0- Required for Silero VAD audio processing- Updated all dependencies to use version constraints for stability
- Default model changed from
large-v2tolarge-v3 - Default host changed from
localhostto0.0.0.0(listen on all interfaces) - Fixed
whisper.envto use correct CLI argument format
- Fixed Dockerfile CMD to reference
whisper_server.pyinstead ofwhisper.py - Corrected model parameter format in
whisper.env
- Comprehensive installation guide added (
INSTALL.md) - Updated
README.mdwith:- Silero VAD feature description
- Improved installation instructions for all platforms
- Health check endpoint documentation
- Troubleshooting section
- Docker setup notes
- Added version-pinned dependencies to
requirements.txt
None - All changes are backward compatible
If upgrading from a previous version:
- Update dependencies:
pip install -r requirements.txt --upgrade - First transcription will download Silero VAD model (~2MB, one-time)
- Update
whisper.envif using Docker (see new format)
- VAD processing adds ~100-200ms per transcription
- Total time saved by skipping tones/silence typically exceeds VAD overhead
- No additional memory requirements beyond initial model load