Skip to content

NemoPreprocessor: Divide by zero warning if silence #119

@Robotvasya

Description

@Robotvasya

There's a problem with the preprocessor for Nemo models when using VAD. (This issue hasn't been observed with other models).

If an audio segment begins with silence, a Numpy warning about a division (by zero) error is thrown. Also the end of segment value goes into overflow.

Code:

asr = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3")
vad = onnx_asr.load_vad("silero")
asr_with_vad = asr.with_vad(
    vad,
    threshold=0.40,
    min_speech_duration_ms=150, 
    max_speech_duration_s=30, 
    min_silence_duration_ms=1200,  
    speech_pad_ms=110 
)
result = asr_with_vad.recognize("test_files/test.wav")

for segment in result:
    # if segment.text:  # Here's a dirty hack to remove parts with silence
    print(f"Начало: {int(segment.start)//60:02d}:{int(segment.start)%60:02d}")
    print(f"Конец: {segment.end} сек.")
    print(f"Текст: {segment.text}")
    print(f"Длительность: {(segment.end - segment.start):.2f} сек.")
    print("-" * 40)

Console log output:

onnx_asr\preprocessors\numpy_preprocessor.py:171: RuntimeWarning: invalid value encountered in divide
  mean = np.divide(
Начало: 00:00
Конец: -62499999999.89 сек.
Текст: 
Длительность: -62499999999.89 сек.
----------------------------------------
Начало: 00:03
Конец: 28.366 сек.
Текст: Итак, давайте сейчас сделаем все необходимое, подключение цветов, шрифтов, в целом, bla-bla-bla
Длительность: 24.80 сек.
----------------------------------------

Also the end of segment value goes into overflow: segment.end = -62499999999.89.

numpy_preprocessor.py:

class NemoPreprocessorNumpy(_NumpyPreprocessor):
    ... 
    mean = np.divide(
        np.where(mask, log_mel_spectrogram, 0.0).sum(axis=1, keepdims=True),
        features_lens[:, None, None],
        dtype=np.float32,
    )
    ...

Dumb solution (not tested):

mean = np.divide(
    np.where(mask, log_mel_spectrogram, 0.0).sum(axis=1, keepdims=True),
-    features_lens[:, None, None],
+    np.maximum(features_lens[:, None, None], 1), #  maximum or 1
    dtype=np.float32,
)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions