Feature request: Add SenseVoice/Paraformer as alternative ASR backend

## Summary

Would it be possible to add [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) and [Paraformer](https://github.com/modelscope/FunASR) as alternative ASR backends alongside the current CTranslate2-based Whisper?

## Motivation

faster-whisper excels at efficient Whisper inference. However, some use cases would benefit from alternative ASR architectures:

- **SenseVoice** (234M params): Non-autoregressive model achieving ~25x faster than Whisper-large with comparable accuracy on 50+ languages. Also provides emotion detection and audio event classification.
- **Paraformer**: Non-autoregressive Chinese ASR with state-of-the-art accuracy on AISHELL benchmarks, including built-in VAD and punctuation.
- **Fun-ASR-Nano**: LLM-based ASR (SenseVoice encoder + Qwen3-0.6B decoder, 800M params) supporting 31 languages.

All models are available via [FunASR](https://github.com/modelscope/FunASR) (`pip install funasr`) and as ONNX exports via [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) for CTranslate2-like optimized inference.

## Benchmark comparison

| Model | Type | Speed (GPU) | Languages | Extra features |
|-------|------|-------------|-----------|----------------|
| Whisper-large-v3 | Autoregressive | 1x baseline | 99 | Translation |
| SenseVoice-Small | Non-AR | ~25x | 50+ | Emotion, events |
| Paraformer-large | Non-AR | ~170x realtime | Chinese | VAD, punctuation |

## Quick start

```python
pip install funasr

from funasr import AutoModel
model = AutoModel(model="iic/SenseVoiceSmall")
result = model.generate(input="audio.wav")
```

## References

- FunASR: https://github.com/modelscope/FunASR (16K+ stars)
- SenseVoice paper: https://arxiv.org/abs/2407.04051
- Paraformer paper: https://arxiv.org/abs/2206.08317

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Add SenseVoice/Paraformer as alternative ASR backend #1447

Summary

Motivation

Benchmark comparison

Quick start

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Type	Speed (GPU)	Languages	Extra features
Whisper-large-v3	Autoregressive	1x baseline	99	Translation
SenseVoice-Small	Non-AR	~25x	50+	Emotion, events
Paraformer-large	Non-AR	~170x realtime	Chinese	VAD, punctuation

Feature request: Add SenseVoice/Paraformer as alternative ASR backend #1447

Description

Summary

Motivation

Benchmark comparison

Quick start

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions