feat: add automatic filler word removal from transcriptions#589
feat: add automatic filler word removal from transcriptions#589cjpais merged 4 commits intocjpais:mainfrom
Conversation
Add filter_transcription_output() that removes filler words (uh, um, hmm, etc.) and hallucination patterns ([AUDIO], (pause), <tag>...</tag>) from transcriptions. Inspired by VoiceInk's approach. - Add regex-based filter in audio_toolkit/text.rs - Integrate into transcription pipeline after custom word correction - Add comprehensive tests for filler word removal - Add regex crate dependency
Add collapse_stutters() to reduce model hallucination artifacts like "wh wh wh wh wh why" -> "wh why" or "I I I I think" -> "I think". Collapses any 1-2 letter word that repeats 3+ times consecutively to a single instance (case-insensitive matching).
Additional fix: Stutter collapseAdded Before: Before: Collapses any 1-2 letter word that repeats 3+ times consecutively to a single instance. Two repetitions like "no no" are preserved. |
|
I skimmed this code and I think it's gonna get pulled in. This seems pretty reasonable and definitely should be a default. Thank you for implementing this. |
Thanks. Incidentally I'm noticing a lot of stuttering issues with Parakeet V3, i.e. lots of repeated 2/3-letter words, like 5+ times |
|
Yeah, I get them quite frequently as well. And it happens if I do kind of a micro stutter, which normally you wouldn't notice but it really exaggerates it. |
Analysis: Why VoiceInk has fewer stuttersI compared Handy's audio preprocessing with VoiceInk (which uses the same Parakeet V3 model but has noticeably fewer stuttering artifacts). Key difference: VAD threshold
Location in Handy: let silero = SileroVad::new(vad_path, 0.3)Why this mattersA lower threshold (0.3) means the VAD flags audio as "speech" even with low confidence. This sends more ambiguous audio to Parakeet, which then struggles and produces stuttering artifacts like VoiceInk's 0.7 threshold is more selective - it only passes audio that's clearly speech, giving the model cleaner input. Proposed fixIncrease VAD threshold from Would be happy to submit a follow-up PR with this change if you'd like to test it. |
|
I think if we're gonna tune the vad threshold, we should add it as a slider in the debug menu so we can adjust it directly there and find a good default more broadly. I think I'm quite hesitant to edit the default since, on the whole, it seems to work relatively well for most people. And your post-processing step should clean up Parakeet's issues. I would rather have more audio than less audio, generally speaking. |
|
Okay, I removed the bracketing changes mainly because I don't see them typically and I'd rather keep the code slim until we run into the issues for now. |
Summary
[AUDIO],(pause),<tag>...</tag>TranscriptionOutputFilterapproachChanges
filter_transcription_output()function inaudio_toolkit/text.rsregexcrate dependencyFiller words removed
uh,um,uhm,umm,uhh,uhhh,ah,eh,hmm,hm,mmm,mm,mh,ha,ehhTest plan
cargo test --lib text)🤖 Generated with Claude Code