feat(vad): drop silent audio windows to prevent hallucinations by JMLX42 · Pull Request #1 · lx-industries/scribble

JMLX42 · 2026-01-14T20:31:16Z

Summary

When VAD detects no speech in an audio window, skip forwarding it to Whisper entirely
This prevents hallucinations like "Merci" or "Thank you for watching" that Whisper produces from silence with high confidence

Problem

Whisper hallucinates high-confidence text from silence. For example, complete silence can be transcribed as "Merci." with 88-97% token probability. Token-level filtering cannot catch this because the hallucination is a single, high-confidence token.

Solution

Filter at the VAD level. If VAD returns false (no speech detected), don't forward that audio window to Whisper. No audio → no hallucination.

Changes

process_ready_windows(): skip windows where VAD returns false
flush(): only forward final buffer if VAD detects speech

Test plan

Unit tests pass
Clippy clean
Manual test: record silence, verify no transcription output
Manual test: record speech, verify normal transcription
Manual test: record speech with silent gaps, verify gaps don't produce hallucinations

Add pass-through features for GPU backends: - cuda: NVIDIA CUDA - metal: Apple Metal - hipblas: AMD ROCm - vulkan: Cross-platform Vulkan - coreml: Apple CoreML This allows consumers to enable GPU acceleration by adding the appropriate feature to their Cargo.toml, e.g.: scribble = { version = "0.5", features = ["cuda"] }

…cases By default, the incremental transcriber waits for 2+ segments before emitting, treating the last segment as potentially incomplete. This adds latency for short utterances like voice assistant commands. The new `emit_single_segments` option (default: false) allows emitting single segments immediately when detected. This is useful for: - Voice assistants - Real-time transcription - Any application where low latency is more important than waiting for natural sentence boundaries When enabled, single segments are emitted as soon as Whisper produces them, rather than waiting for a second segment or the 30-second force-flush timeout.

When VAD detects no speech in an audio window, skip forwarding it to Whisper entirely. This prevents hallucinations like "Merci" or "Thank you for watching" that Whisper produces from silence with high confidence. Changes: - process_ready_windows(): skip windows where VAD returns false - flush(): only forward final buffer if VAD detects speech Also fixes pre-existing test compilation (missing emit_single_segments field) and formatting issues.

Move VAD filtering from the high-level Scribble API into the backend stream. This ensures VAD works regardless of which API consumers use (direct backend access or high-level Scribble::transcribe). Changes: - WhisperStream now optionally wraps audio with VadStream when enable_voice_activity_detection is true - Remove VAD wrapping from Scribble::transcribe_with_encoder() to avoid double-filtering - Export VadProcessor, VadStream, VadStreamReceiver publicly - Make VadStream methods public for use in backend This fixes the issue where friday-daemon's direct backend usage bypassed VAD filtering entirely.

JMLX42 added 5 commits January 12, 2026 21:16

Merge branch 'fix/emit-single-segments' into friday

afbb572

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vad): drop silent audio windows to prevent hallucinations#1

feat(vad): drop silent audio windows to prevent hallucinations#1
JMLX42 wants to merge 5 commits intomainfrom
feat/vad-gated-transcription

JMLX42 commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JMLX42 commented Jan 14, 2026

Summary

Problem

Solution

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant