Skip to content

feat: expose condition_on_previous_text parameter to prevent cascading degeneration in long audio #619

@Benju1

Description

@Benju1

Problem

Long audio recordings (>30s) with natural speech pauses produce degenerate output:
Repeated commas, truncated transcription, or hallucinated text. This is closely related to #134.

Root Cause

In non-batched mode, faster-whisper concatenates all VAD speech segments into a single audio blob and processes it in 30-second windows (WhisperModel.transcribe(), line audio = np.concatenate(audio_chunks, axis=0)). With condition_on_previous_text=True (the faster-whisper default), degenerate output from one window is fed as context into the next, causing a cascade failure that corrupts all subsequent transcription.

This is a known faster-whisper issue (see SYSTRAN/faster-whisper#843), and it is unlikely to be fixed upstream soon. Speaches can mitigate this today by exposing the condition_on_previous_text parameter that faster-whisper already supports but Speaches does not pass through.

The problem is especially pronounced with:

  • Technical speech containing numbers and domain terms (e.g., "14,4 kV")
  • Recordings longer than 30 seconds with natural pauses
  • German and other non-English languages

Reproduction

Using model TheTobyB/whisper-large-v3-turbo-german-ct2 with German audio:

  1. Record ~70 seconds of technical speech with natural pauses between sentences
  2. Send to the API:
    curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
      -F "file=@recording.wav" \
      -F "language=de" \
      -F "model=TheTobyB/whisper-large-v3-turbo-german-ct2"
  3. Transcription cuts off after ~30s, ending with degenerate output like ",,"

Logs show VAD keeps nearly the entire audio as one segment, and Whisper processes it in 3 x 30s windows:

Processing audio with duration 01:10.490
VAD filter removed 00:02.352 of audio
VAD filter kept the following audio segments: [00:02.352 -> 01:10.490]
Processing segment at 00:00.000
Processing segment at 00:30.000
Processing segment at 01:00.000

Proposed Solution

Expose condition_on_previous_text as a configurable parameter — similar to how #457 made vad_filter configurable:

  1. Add condition_on_previous_text to config or as a Form() parameter on the STT endpoints
  2. Pass it through to whisper_model.transcribe(..., condition_on_previous_text=...)

Setting it to False makes each 30s window independent, so degeneration in one window stays isolated and subsequent windows recover. This is the minimal, tested fix.

Tradeoff: Slight loss of cross-window context (a sentence spanning a 30s boundary loses prior context). In practice this is negligible compared to losing the entire second half of a transcription.

Workarounds

Until this is configurable, two workarounds exist:

Approach How Tradeoff
condition_on_previous_text=False Volume-mount patched stt.py Minimal accuracy loss at window boundaries
WHISPER__USE_BATCHED_MODE=true Config change, no code patch More word errors on technical vocabulary (each VAD segment loses all prior context)

Both produce complete transcriptions. condition_on_previous_text=False delivers better word accuracy on technical content.

Environment

  • speaches latest-cpu (digest sha256:21e3df06d842...)
  • Model: TheTobyB/whisper-large-v3-turbo-german-ct2
  • Non-batched mode (default)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions