Summary
Would it be possible to add SenseVoice and Paraformer as alternative ASR backends alongside the current CTranslate2-based Whisper?
Motivation
faster-whisper excels at efficient Whisper inference. However, some use cases would benefit from alternative ASR architectures:
- SenseVoice (234M params): Non-autoregressive model achieving ~25x faster than Whisper-large with comparable accuracy on 50+ languages. Also provides emotion detection and audio event classification.
- Paraformer: Non-autoregressive Chinese ASR with state-of-the-art accuracy on AISHELL benchmarks, including built-in VAD and punctuation.
- Fun-ASR-Nano: LLM-based ASR (SenseVoice encoder + Qwen3-0.6B decoder, 800M params) supporting 31 languages.
All models are available via FunASR (pip install funasr) and as ONNX exports via Sherpa-ONNX for CTranslate2-like optimized inference.
Benchmark comparison
| Model |
Type |
Speed (GPU) |
Languages |
Extra features |
| Whisper-large-v3 |
Autoregressive |
1x baseline |
99 |
Translation |
| SenseVoice-Small |
Non-AR |
~25x |
50+ |
Emotion, events |
| Paraformer-large |
Non-AR |
~170x realtime |
Chinese |
VAD, punctuation |
Quick start
pip install funasr
from funasr import AutoModel
model = AutoModel(model="iic/SenseVoiceSmall")
result = model.generate(input="audio.wav")
References
Summary
Would it be possible to add SenseVoice and Paraformer as alternative ASR backends alongside the current CTranslate2-based Whisper?
Motivation
faster-whisper excels at efficient Whisper inference. However, some use cases would benefit from alternative ASR architectures:
All models are available via FunASR (
pip install funasr) and as ONNX exports via Sherpa-ONNX for CTranslate2-like optimized inference.Benchmark comparison
Quick start
References