-
Notifications
You must be signed in to change notification settings - Fork 375
Description
Problem
Long audio recordings (>30s) with natural speech pauses produce degenerate output:
Repeated commas, truncated transcription, or hallucinated text. This is closely related to #134.
Root Cause
In non-batched mode, faster-whisper concatenates all VAD speech segments into a single audio blob and processes it in 30-second windows (WhisperModel.transcribe(), line audio = np.concatenate(audio_chunks, axis=0)). With condition_on_previous_text=True (the faster-whisper default), degenerate output from one window is fed as context into the next, causing a cascade failure that corrupts all subsequent transcription.
This is a known faster-whisper issue (see SYSTRAN/faster-whisper#843), and it is unlikely to be fixed upstream soon. Speaches can mitigate this today by exposing the condition_on_previous_text parameter that faster-whisper already supports but Speaches does not pass through.
The problem is especially pronounced with:
- Technical speech containing numbers and domain terms (e.g., "14,4 kV")
- Recordings longer than 30 seconds with natural pauses
- German and other non-English languages
Reproduction
Using model TheTobyB/whisper-large-v3-turbo-german-ct2 with German audio:
- Record ~70 seconds of technical speech with natural pauses between sentences
- Send to the API:
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \ -F "file=@recording.wav" \ -F "language=de" \ -F "model=TheTobyB/whisper-large-v3-turbo-german-ct2"
- Transcription cuts off after ~30s, ending with degenerate output like ",,"
Logs show VAD keeps nearly the entire audio as one segment, and Whisper processes it in 3 x 30s windows:
Processing audio with duration 01:10.490
VAD filter removed 00:02.352 of audio
VAD filter kept the following audio segments: [00:02.352 -> 01:10.490]
Processing segment at 00:00.000
Processing segment at 00:30.000
Processing segment at 01:00.000
Proposed Solution
Expose condition_on_previous_text as a configurable parameter — similar to how #457 made vad_filter configurable:
- Add
condition_on_previous_textto config or as aForm()parameter on the STT endpoints - Pass it through to
whisper_model.transcribe(..., condition_on_previous_text=...)
Setting it to False makes each 30s window independent, so degeneration in one window stays isolated and subsequent windows recover. This is the minimal, tested fix.
Tradeoff: Slight loss of cross-window context (a sentence spanning a 30s boundary loses prior context). In practice this is negligible compared to losing the entire second half of a transcription.
Workarounds
Until this is configurable, two workarounds exist:
| Approach | How | Tradeoff |
|---|---|---|
condition_on_previous_text=False |
Volume-mount patched stt.py |
Minimal accuracy loss at window boundaries |
WHISPER__USE_BATCHED_MODE=true |
Config change, no code patch | More word errors on technical vocabulary (each VAD segment loses all prior context) |
Both produce complete transcriptions. condition_on_previous_text=False delivers better word accuracy on technical content.
Environment
- speaches
latest-cpu(digestsha256:21e3df06d842...) - Model:
TheTobyB/whisper-large-v3-turbo-german-ct2 - Non-batched mode (default)