There's a problem with the preprocessor for Nemo models when using VAD. (This issue hasn't been observed with other models).
If an audio segment begins with silence, a Numpy warning about a division (by zero) error is thrown. Also the end of segment value goes into overflow.
Code:
asr = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")
vad = onnx_asr.load_vad("silero")
asr_with_vad = asr.with_vad(
vad,
threshold=0.40,
min_speech_duration_ms=150,
max_speech_duration_s=30,
min_silence_duration_ms=1200,
speech_pad_ms=110
)
result = asr_with_vad.recognize("test_files/test.wav")
for segment in result:
# if segment.text: # Here's a dirty hack to remove parts with silence
print(f"Начало: {int(segment.start)//60:02d}:{int(segment.start)%60:02d}")
print(f"Конец: {segment.end} сек.")
print(f"Текст: {segment.text}")
print(f"Длительность: {(segment.end - segment.start):.2f} сек.")
print("-" * 40)
Console log output:
onnx_asr\preprocessors\numpy_preprocessor.py:171: RuntimeWarning: invalid value encountered in divide
mean = np.divide(
Начало: 00:00
Конец: -62499999999.89 сек.
Текст:
Длительность: -62499999999.89 сек.
----------------------------------------
Начало: 00:03
Конец: 28.366 сек.
Текст: Итак, давайте сейчас сделаем все необходимое, подключение цветов, шрифтов, в целом, bla-bla-bla
Длительность: 24.80 сек.
----------------------------------------
Also the end of segment value goes into overflow: segment.end = -62499999999.89.
numpy_preprocessor.py:
class NemoPreprocessorNumpy(_NumpyPreprocessor):
...
mean = np.divide(
np.where(mask, log_mel_spectrogram, 0.0).sum(axis=1, keepdims=True),
features_lens[:, None, None],
dtype=np.float32,
)
...
Dumb solution (not tested):
mean = np.divide(
np.where(mask, log_mel_spectrogram, 0.0).sum(axis=1, keepdims=True),
- features_lens[:, None, None],
+ np.maximum(features_lens[:, None, None], 1), # maximum or 1
dtype=np.float32,
)
There's a problem with the preprocessor for Nemo models when using VAD. (This issue hasn't been observed with other models).
If an audio segment begins with silence, a Numpy warning about a division (by zero) error is thrown. Also the end of segment value goes into overflow.
Code:
Console log output:
Also the end of segment value goes into overflow: segment.end = -62499999999.89.
numpy_preprocessor.py:
Dumb solution (not tested):
mean = np.divide( np.where(mask, log_mel_spectrogram, 0.0).sum(axis=1, keepdims=True), - features_lens[:, None, None], + np.maximum(features_lens[:, None, None], 1), # maximum or 1 dtype=np.float32, )