Skip to content

Commit 05a2491

Browse files
committed
Update readme (add models comparison)
1 parent cdd09ea commit 05a2491

1 file changed

Lines changed: 49 additions & 4 deletions

File tree

README.md

Lines changed: 49 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Automatic Speech Recognition in Python using ONNX models
22

3-
[![CI](https://github.com/istupakov/onnx-asr/actions/workflows/python-package.yml/badge.svg)](https://github.com/istupakov/onnx-asr/actions/workflows/python-package.yml)
43
[![PyPI - Version](https://img.shields.io/pypi/v/onnx-asr.svg)](https://pypi.org/project/onnx-asr)
54
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/onnx-asr.svg)](https://pypi.org/project/onnx-asr)
5+
[![CI](https://github.com/istupakov/onnx-asr/actions/workflows/python-package.yml/badge.svg)](https://github.com/istupakov/onnx-asr/actions/workflows/python-package.yml)
66

77
The simple speech recognition package with minimal dependencies:
88
* NumPy ([numpy](https://numpy.org/))
@@ -11,11 +11,17 @@ The simple speech recognition package with minimal dependencies:
1111

1212
The package does not yet have built-in VAD support, so in order to recognize long audio files, they must first be split into parts.
1313

14-
## Supported models
14+
## Supported models architectures
15+
16+
The package supports the following modern ASR model architectures ([comparison](#comparison-with-original-implementations) with original implementations):
1517
* Nvidia NeMo Conformer/FastConformer (with CTC and RNN-T decoders)
1618
* Kaldi Icefall Zipformer (with stateless RNN-T decoder) including Alpha Cephei Vosk 0.52+
1719
* Sber GigaAM v2 (with CTC and RNN-T decoders)
18-
* OpenAI Whisper (with simple decoding)
20+
* OpenAI Whisper
21+
22+
When saving these models in onnx format, usually only the encoder and decoder are saved. To run them, the corresponding preprocessor and decoding must be implemented. Therefore, the package contains these implementations for all supported models:
23+
* Log-mel spectrogram preprocessors
24+
* Greedy search decoding
1925

2026
## Installation
2127

@@ -110,14 +116,53 @@ import onnx_asr
110116
model = onnx_asr.load_model("gigaam-v2-ctc", "models/gigaam-onnx")
111117
print(model.recognize("test.wav"))
112118
```
113-
Supported model types:
119+
#### Supported model types:
114120
* All models from [supported model names](#supported-model-names)
115121
* `nemo-conformer-ctc` for NeMo Conformer with CTC decoder
116122
* `nemo-conformer-rnnt` for NeMo Conformer with RNN-T decoder
117123
* `kaldi-rnnt` or `vosk` for Kaldi Icefall Zipformer with stateless RNN-T decoder
118124
* `whisper-ort` for Whisper (exported with [onnxruntime](#openai-whisper-with-onnxruntime-export))
119125
* `whisper-hf` for Whisper (exported with [optimum](#openai-whisper-with-optimum-export))
120126

127+
## Comparison with original implementations
128+
129+
Packages with original implementations:
130+
* `gigaam` for GigaAM models ([github](https://github.com/salute-developers/GigaAM))
131+
* `nemo-toolkit` for NeMo models ([github](https://github.com/nvidia/nemo))
132+
* `openai-whisper` for Whisper models ([github](https://github.com/openai/whisper))
133+
* `sherpa-onnx` for Vosk models ([github](https://github.com/k2-fsa/sherpa-onnx), [docs](https://k2-fsa.github.io/sherpa/onnx/index.html))
134+
135+
Tests were performed on a *test* subset of the [Russian LibriSpeech](https://openslr.org/96/) dataset.
136+
137+
Hardware:
138+
* CPU - Intel i7-7700HQ
139+
* GPU - Nvidia T4 (in Google Colab)
140+
141+
| Model | Package / decoding | CER | WER | RTFx (CPU) | RTFx (GPU) |
142+
|--------------------------|----------------------|--------|--------|------------|--------------|
143+
| GigaAM v2 CTC | default | 1.06% | 5.23% | 7.2 | 44.2 |
144+
| GigaAM v2 CTC | onnx-asr | 1.06% | 5.23% | 11.4 | 58.6 |
145+
| GigaAM v2 RNN-T | default | 1.10% | 5.22% | 5.5 | 23.3 |
146+
| GigaAM v2 RNN-T | onnx-asr | 1.10% | 5.22% | 10.4 | 26 |
147+
| Nemo FastConformer CTC | default | 3.11% | 13.12% | 22.7 | 71.7 |
148+
| Nemo FastConformer CTC | onnx-asr | 3.11% | 13.12% | 43.1 | 88.8 |
149+
| Nemo FastConformer RNN-T | default | 2.63% | 11.62% | 15.9 | 13.9 |
150+
| Nemo FastConformer RNN-T | onnx-asr | 2.63% | 11.62% | 26.0 | 49 |
151+
| Vosk 0.52 small | greedy_search | 3.64% | 14.53% | 48.2 | 71.4 |
152+
| Vosk 0.52 small | modified_beam_search | 3.50% | 14.25% | 29.0 | 24.7 |
153+
| Vosk 0.52 small | onnx-asr | 3.64% | 14.53% | 42.5 | 60.2 |
154+
| Vosk 0.54 | greedy_search | 2.21% | 9.89% | 34.8 | 64.2 |
155+
| Vosk 0.54 | modified_beam_search | 2.21% | 9.85% | 23.9 | 24 |
156+
| Vosk 0.54 | onnx-asr | 2.21% | 9.89% | 32.2 | 55.9 |
157+
| Whisper base | default | 10.53% | 38.82% | 5.4 | 13.6 |
158+
| Whisper base | onnx-asr | 10.64% | 38.33% | 6.3** | 16.1*/19.4** |
159+
| Whisper large-v3-turbo | default | 2.96% | 10.27% | N/A | 11 |
160+
| Whisper large-v3-turbo | onnx-asr | 2.63% | 10.08% | N/A | 9.8* |
161+
162+
1. \* `whisper-hf` model ([model types](#supported-model-types)) with `fp16` quantization.
163+
2. ** `whisper-ort` model ([model types](#supported-model-types)).
164+
3. All other models were run with the default precision - `fp32` on CPU and `fp32` or `fp16` (some of the original models) on GPU.
165+
121166
## Convert model to ONNX
122167

123168
### Nvidia NeMo Conformer/FastConformer

0 commit comments

Comments
 (0)