faster-whisper does a great job optimizing Whisper inference with CTranslate2. I wanted to bring up a discussion about non-autoregressive ASR models as a potential alternative approach.
The speed ceiling of autoregressive models
Even with CTranslate2 optimizations, Whisper's autoregressive architecture has a fundamental speed ceiling — each token depends on the previous one, limiting parallelism. faster-whisper achieves ~4x speedup over vanilla Whisper, reaching roughly 50x realtime on GPU.
Non-autoregressive alternative: SenseVoice
SenseVoice takes a fundamentally different approach — it's non-autoregressive, processing the entire sequence in parallel:
| Metric |
faster-whisper (large-v3) |
SenseVoice Small |
| Architecture |
Autoregressive (CTranslate2) |
Non-autoregressive |
| GPU Speed |
~50x realtime |
170x realtime |
| CPU Speed |
~10x realtime |
17x realtime |
| Params |
1.5B |
234M |
| Chinese CER |
8.4% (AISHELL) |
3.2% |
| English WER |
5.1% (LibriSpeech) |
Competitive |
| Punctuation |
No |
Built-in |
| Languages |
99 |
50+ |
Not a replacement, but a complement
I'm not suggesting replacing Whisper — the two architectures have different strengths. But for users who prioritize speed over language coverage (50+ vs 99 languages), SenseVoice could be an interesting alternative backend.
Resources
Curious what the community thinks about multi-model support in faster-whisper.
faster-whisper does a great job optimizing Whisper inference with CTranslate2. I wanted to bring up a discussion about non-autoregressive ASR models as a potential alternative approach.
The speed ceiling of autoregressive models
Even with CTranslate2 optimizations, Whisper's autoregressive architecture has a fundamental speed ceiling — each token depends on the previous one, limiting parallelism. faster-whisper achieves ~4x speedup over vanilla Whisper, reaching roughly 50x realtime on GPU.
Non-autoregressive alternative: SenseVoice
SenseVoice takes a fundamentally different approach — it's non-autoregressive, processing the entire sequence in parallel:
Not a replacement, but a complement
I'm not suggesting replacing Whisper — the two architectures have different strengths. But for users who prioritize speed over language coverage (50+ vs 99 languages), SenseVoice could be an interesting alternative backend.
Resources
Curious what the community thinks about multi-model support in faster-whisper.