This is a local Python prototype for detecting known voices, such as NFL commentators, and muting them in near real time.
It is a prototype, not a polished product. The focus is:
- enroll target voices from audio clips
- test detection accuracy against recorded clips
- listen to live audio from an input device
- compare short speech windows to enrolled voice profiles
- mute output when a blocked speaker is active
The pipeline is:
- Capture audio from a local input device
- Run voice activity detection to skip silence and low-speech chunks
- Generate a speaker embedding for each speech chunk
- Compare that embedding to saved speaker profiles
- Apply smoothing rules before muting or unmuting
The speaker recognition model uses SpeechBrain's pretrained ECAPA embedding model.
Use Python 3.11 or 3.12.
Many audio ML packages are not yet reliable on Python 3.14, so using a virtual environment on 3.11/3.12 will save you time.
- Install Python
3.11or3.12 - Create a virtual environment
- Install dependencies
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf pyaudio fails to install on macOS, install PortAudio first:
brew install portaudioCreate one WAV file per commentator you want to detect. Cleaner speech gives better results.
Suggested training data per speaker:
- at least
5-10 minutesof mostly clean speech - multiple clips from different games
- minimal crowd noise when possible
Then enroll:
python -m commentator_muter.enroll \
--name joe_buck \
--audio samples/joe_buck_1.wav samples/joe_buck_2.wavThis creates a profile in profiles/joe_buck.json.
If you want to build a better dataset directly from routed game audio, you can record labeled clips from a live input such as BlackHole 2ch.
List devices first:
python -m commentator_muter.capture_training_audio --list-devicesThen record clips for one label:
python -m commentator_muter.capture_training_audio \
--input-device 1 \
--label chrisThe recorder will:
- keep listening on the selected input device
- start a take when you press Enter
- stop the take when you press Enter again
- show clip duration, RMS, and RNNoise speech probability
- let you save, discard, or redo the take
By default, clips are written to:
samples/captured/<label>/
You can point it somewhere else if you want to record directly into a training folder:
python -m commentator_muter.capture_training_audio \
--input-device 1 \
--label mike \
--output-dir samplesYou can also pull in a starter subset of Hugging Face datasets for noise negatives and speaker metadata.
This command downloads:
- a starter subset of MUSAN noise wav files
- VoxCeleb metadata files
- optionally the smaller
vox1_test_wav.ziparchive if you want some VoxCeleb audio right away
python -m commentator_muter.bootstrap_hf_datasetsBy default it writes to:
external_datasets/
If you want a larger MUSAN starter pack:
python -m commentator_muter.bootstrap_hf_datasets --musan-noise-count 250If you want the optional VoxCeleb test audio archive too:
python -m commentator_muter.bootstrap_hf_datasets --include-vox1-test-audioNotes:
- MUSAN is easy to use as noise/background negatives.
- The full VoxCeleb archives are very large, so this bootstrap command stays conservative by default.
List devices first if needed:
python -m commentator_muter.run --list-devicesThen start the prototype:
python -m commentator_muter.run \
--input-device 0 \
--output-device 1 \
--block joe_buck \
--threshold 0.72Before wiring this into another app, you can test the speaker detector by running it on saved clips.
Example:
python -m commentator_muter.detect \
--audio tests/joe_clip.wav tests/troy_clip.wav \
--threshold 0.72For each speech window, the tool prints:
- the timestamp inside the clip
- the best speaker match
- the best score
- the top profile scores for comparison
Then it prints a clip-level summary showing:
- how many windows cleared the threshold
- which speaker dominated the clip
- how often each speaker won
If you want just summaries:
python -m commentator_muter.detect \
--audio tests/*.wav \
--summary-onlyIf you want machine-readable output:
python -m commentator_muter.detect \
--audio tests/*.wav \
--json-outputIf you only care about one commentator right now, enroll just that voice and run the binary detector.
Example:
python -m commentator_muter.enroll \
--name chris \
--audio samples/chris_1.wav samples/chris_2.wav
python -m commentator_muter.detect_binary \
--target chris \
--audio tests/chris_clip.wav tests/other_clip.wav \
--threshold 0.72This prints a simple result for each window:
CHRISNOT_CHRIS- similarity score
Then it prints a clip-level summary with:
- final decision
- max score
- average score
- accepted windows over total speech windows
You can also run the detector live and have it print whether the active voice sounds like Chris.
List devices:
python -m commentator_muter.live_detect --list-devicesThen monitor a microphone or routed audio input:
python -m commentator_muter.live_detect \
--target chris \
--input-device 0It will print one of:
CHRISCHRIS_HOLDNOT_CHRISNO_SPEECH
If you want the audio output to line up with the detection timing, use the delayed live command.
It:
- reads live audio from an input device
- buffers it for a configurable delay
- plays the delayed audio to an output device
- runs detection against that same delayed timeline
Example:
python -m commentator_muter.live_delay \
--target chris \
--input-device 1 \
--output-device 3 \
--delay-seconds 1.5This is the first step toward a synced “delay, detect, then act” architecture for muting.
The live detector now uses a sticky threshold:
- Chris must cross the normal threshold to enter
CHRIS - for a short time after a strong match, the detector uses a slightly lower threshold
- that reduces flicker during crowd noise or weaker speech windows
- it also keeps a short decision window so overlapping or noisy chunks are judged with recent context instead of one chunk at a time
- before scoring, it now runs stronger speech enhancement and a target-focused extraction step that keeps the subsegments most likely to sound like Chris
You can tune it with:
python -m commentator_muter.live_detect \
--target chris \
--input-device 0 \
--extraction-subwindow-seconds 0.35 \
--extraction-subwindow-stride-seconds 0.175 \
--extraction-keep-ratio 0.5 \
--extraction-min-score 0.28 \
--decision-window-seconds 1.6 \
--min-positive-windows 2 \
--strong-positive-score 0.58 \
--sticky-hold-seconds 2.0 \
--sticky-threshold-drop 0.05 \
--negative-chunks-to-release 3- Start with one blocked speaker first
- Use volume ducking before full mute if the hard cut feels too abrupt
- Expect some latency, usually around
1-2sin prototype form with the delayed decision window - Overlapping speech and loud crowd noise are the hardest cases
commentator_muter/enroll.py: build speaker profilescommentator_muter/detect.py: test speaker detection against recorded audiocommentator_muter/detect_binary.py: test one speaker as yes-or-nocommentator_muter/live_detect.py: live terminal feedback for one speakercommentator_muter/live_delay.py: delayed live playback plus synced detectioncommentator_muter/run.py: live detection and mute pipelinecommentator_muter/speaker_id.py: embedding and profile matchingcommentator_muter/audio.py: stream handling and VAD helperscommentator_muter/config.py: runtime settings