This guide shows how to fine-tune the FSMN-VAD (Voice Activity Detection) model on your own data. Fine-tuning is recommended when the default model does not perform well on your specific audio domain (e.g., noisy environments, walkie-talkie audio, call center recordings).
- High miss rate (speech segments not detected)
- High false alarm rate (non-speech detected as speech)
- Domain-specific audio that differs significantly from the training data (e.g., 8kHz telephony, high-noise industrial environments)
If you only need minor adjustments, try parameter tuning first.
- wav.scp — Audio file list
- vad.txt — VAD labels (speech segment timestamps)
utt001 /path/to/audio1.wav
utt002 /path/to/audio2.wav
utt003 /path/to/audio3.wav
Each line contains the utterance ID followed by a list of speech segments [start_ms, end_ms]:
utt001 [[320, 2150], [3800, 6420], [8100, 12500]]
utt002 [[0, 5000], [6200, 9800]]
utt003 [[100, 3000]]
Prepare separate train_wav.scp / train_vad.txt and val_wav.scp / val_vad.txt for training and validation.
- Minimum: 50-100 labeled audio files
- Recommended: 500+ for robust fine-tuning
- Include both positive (speech) and negative (silence/noise) examples representative of your target domain
data/list/
├── train_wav.scp
├── train_vad.txt
├── val_wav.scp
└── val_vad.txt
cd examples/industrial_data_pretraining/fsmn_vad_streaming
bash finetune.shKey parameters in finetune.sh:
| Parameter | Default | Description |
|---|---|---|
model_name_or_model_dir |
iic/speech_fsmn_vad_zh-cn-16k-common-pytorch |
Base model (use *-8k-* for 8kHz audio) |
train_conf.max_epoch |
20 | Training epochs |
optim_conf.lr |
0.00005 | Learning rate (lower = more conservative) |
train_conf.validate_interval |
1000 | Validation frequency (steps) |
from funasr import AutoModel
model = AutoModel(model="/path/to/outputs/model_dir")
result = model.generate(input="test.wav")For minor adjustments, pass VAD parameters directly:
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh",
vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch",
vad_kwargs={
"max_end_silence_time": 500, # Max silence within a sentence (ms)
"speech_noise_thres": 0.5, # Speech/noise threshold (lower = more sensitive)
"max_single_segment_time": 60000, # Max segment duration (ms)
}
)| Parameter | Default | Effect |
|---|---|---|
max_end_silence_time |
800 | Increase to tolerate longer pauses within sentences |
speech_noise_thres |
0.6 | Decrease to detect more speech (may increase false alarms) |
max_single_segment_time |
60000 | Max duration of a single speech segment |
speech_2_noise_ratio |
1.0 | Decrease to be more sensitive to speech |
max_start_silence_time |
3000 | Max leading silence before speech starts |