Skip to content

feat: add automatic filler word removal from transcriptions#589

Merged
cjpais merged 4 commits intocjpais:mainfrom
pchalasani:filler-removal
Jan 18, 2026
Merged

feat: add automatic filler word removal from transcriptions#589
cjpais merged 4 commits intocjpais:mainfrom
pchalasani:filler-removal

Conversation

@pchalasani
Copy link
Contributor

Summary

  • Add automatic filtering of filler words (uh, um, hmm, etc.) from transcriptions
  • Remove hallucination patterns like [AUDIO], (pause), <tag>...</tag>
  • Inspired by VoiceInk's TranscriptionOutputFilter approach

Changes

  • Add filter_transcription_output() function in audio_toolkit/text.rs
  • Integrate filter into transcription pipeline (runs after custom word correction)
  • Add regex crate dependency
  • Add comprehensive unit tests

Filler words removed

uh, um, uhm, umm, uhh, uhhh, ah, eh, hmm, hm, mmm, mm, mh, ha, ehh

Test plan

  • Unit tests pass (cargo test --lib text)
  • Manual testing with Parakeet V3 model - filler words are filtered correctly

🤖 Generated with Claude Code

Add filter_transcription_output() that removes filler words (uh, um, hmm,
etc.) and hallucination patterns ([AUDIO], (pause), <tag>...</tag>) from
transcriptions. Inspired by VoiceInk's approach.

- Add regex-based filter in audio_toolkit/text.rs
- Integrate into transcription pipeline after custom word correction
- Add comprehensive tests for filler word removal
- Add regex crate dependency
Add collapse_stutters() to reduce model hallucination artifacts like
"wh wh wh wh wh why" -> "wh why" or "I I I I think" -> "I think".

Collapses any 1-2 letter word that repeats 3+ times consecutively
to a single instance (case-insensitive matching).
@pchalasani
Copy link
Contributor Author

Additional fix: Stutter collapse

Added collapse_stutters() to handle model hallucination artifacts where the model gets stuck repeating short tokens:

Before: "w wh wh wh wh wh wh wh wh wh why"
After: "wh why"

Before: "I I I I think so so so so"
After: "I think so"

Collapses any 1-2 letter word that repeats 3+ times consecutively to a single instance. Two repetitions like "no no" are preserved.

@cjpais
Copy link
Owner

cjpais commented Jan 16, 2026

I skimmed this code and I think it's gonna get pulled in. This seems pretty reasonable and definitely should be a default. Thank you for implementing this.

@pchalasani
Copy link
Contributor Author

I skimmed this code and I think it's gonna get pulled in. This seems pretty reasonable and definitely should be a default. Thank you for implementing this.

Thanks. Incidentally I'm noticing a lot of stuttering issues with Parakeet V3, i.e. lots of repeated 2/3-letter words, like 5+ times

@cjpais
Copy link
Owner

cjpais commented Jan 16, 2026

Yeah, I get them quite frequently as well. And it happens if I do kind of a micro stutter, which normally you wouldn't notice but it really exaggerates it.

@pchalasani
Copy link
Contributor Author

Analysis: Why VoiceInk has fewer stutters

I compared Handy's audio preprocessing with VoiceInk (which uses the same Parakeet V3 model but has noticeably fewer stuttering artifacts).

Key difference: VAD threshold

App VAD Threshold Effect
Handy 0.3 Very sensitive - captures marginal/unclear audio
VoiceInk 0.7 Conservative - only passes clear speech

Location in Handy: managers/audio.rs:121

let silero = SileroVad::new(vad_path, 0.3)

Why this matters

A lower threshold (0.3) means the VAD flags audio as "speech" even with low confidence. This sends more ambiguous audio to Parakeet, which then struggles and produces stuttering artifacts like "wh wh wh wh why".

VoiceInk's 0.7 threshold is more selective - it only passes audio that's clearly speech, giving the model cleaner input.

Proposed fix

Increase VAD threshold from 0.3 to 0.5 or 0.6. Trade-off: might occasionally clip very quiet speech, but should significantly reduce stuttering.

Would be happy to submit a follow-up PR with this change if you'd like to test it.

@cjpais
Copy link
Owner

cjpais commented Jan 16, 2026

I think if we're gonna tune the vad threshold, we should add it as a slider in the debug menu so we can adjust it directly there and find a good default more broadly. I think I'm quite hesitant to edit the default since, on the whole, it seems to work relatively well for most people. And your post-processing step should clean up Parakeet's issues. I would rather have more audio than less audio, generally speaking.

@cjpais
Copy link
Owner

cjpais commented Jan 18, 2026

Okay, I removed the bracketing changes mainly because I don't see them typically and I'd rather keep the code slim until we run into the issues for now.

@cjpais cjpais merged commit c84e863 into cjpais:main Jan 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants