You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The simple speech recognition package with minimal dependencies:
8
8
* NumPy ([numpy](https://numpy.org/))
@@ -11,11 +11,17 @@ The simple speech recognition package with minimal dependencies:
11
11
12
12
The package does not yet have built-in VAD support, so in order to recognize long audio files, they must first be split into parts.
13
13
14
-
## Supported models
14
+
## Supported models architectures
15
+
16
+
The package supports the following modern ASR model architectures ([comparison](#comparison-with-original-implementations) with original implementations):
15
17
* Nvidia NeMo Conformer/FastConformer (with CTC and RNN-T decoders)
When saving these models in onnx format, usually only the encoder and decoder are saved. To run them, the corresponding preprocessor and decoding must be implemented. Therefore, the package contains these implementations for all supported models:
23
+
* Log-mel spectrogram preprocessors
24
+
* Greedy search decoding
19
25
20
26
## Installation
21
27
@@ -110,14 +116,53 @@ import onnx_asr
110
116
model = onnx_asr.load_model("gigaam-v2-ctc", "models/gigaam-onnx")
111
117
print(model.recognize("test.wav"))
112
118
```
113
-
Supported model types:
119
+
#### Supported model types:
114
120
* All models from [supported model names](#supported-model-names)
115
121
*`nemo-conformer-ctc` for NeMo Conformer with CTC decoder
116
122
*`nemo-conformer-rnnt` for NeMo Conformer with RNN-T decoder
117
123
*`kaldi-rnnt` or `vosk` for Kaldi Icefall Zipformer with stateless RNN-T decoder
118
124
*`whisper-ort` for Whisper (exported with [onnxruntime](#openai-whisper-with-onnxruntime-export))
119
125
*`whisper-hf` for Whisper (exported with [optimum](#openai-whisper-with-optimum-export))
120
126
127
+
## Comparison with original implementations
128
+
129
+
Packages with original implementations:
130
+
*`gigaam` for GigaAM models ([github](https://github.com/salute-developers/GigaAM))
131
+
*`nemo-toolkit` for NeMo models ([github](https://github.com/nvidia/nemo))
132
+
*`openai-whisper` for Whisper models ([github](https://github.com/openai/whisper))
133
+
*`sherpa-onnx` for Vosk models ([github](https://github.com/k2-fsa/sherpa-onnx), [docs](https://k2-fsa.github.io/sherpa/onnx/index.html))
134
+
135
+
Tests were performed on a *test* subset of the [Russian LibriSpeech](https://openslr.org/96/) dataset.
136
+
137
+
Hardware:
138
+
* CPU - Intel i7-7700HQ
139
+
* GPU - Nvidia T4 (in Google Colab)
140
+
141
+
| Model | Package / decoding | CER | WER | RTFx (CPU) | RTFx (GPU) |
0 commit comments