feat: add Paraformer, Zipformer, Punctuation, GPU acceleration (v0.3.5)#42
feat: add Paraformer, Zipformer, Punctuation, GPU acceleration (v0.3.5)#42zhuzhuyule wants to merge 12 commits into
Conversation
|
Let's gooooo! Thank you i will test this and try and pull it in soon |
|
I just casually whipped up a table: |
|
Closing to recreate from a dedicated feature branch (this PR's head was fork/main which now contains unrelated changes). |
|
Okay, when I'm looking at this, it's becoming increasingly obvious we need to significantly modify the codebase. I am going to do this, and then let's get these in. If you don't mind waiting and rebasing on top of this, it would be great. Basically I want to separate things out into engines
or similar, so we can then implement models per engine as well. i think this will be a much better way forward, but will require some better documentation. I think we can get something going like auto model porting as well from a given base implementation (usually hf transformers). We can potentially try and support the transformers implementations too, but largely I'm not super focused on that for the moment. |
|
Before discovering your project, I actually used the sherpa-rs-sys crate, which worked exceptionally well. It not only supported streaming transcription but also allowed the integration of a wider range of models. The only drawback was the third-party code signing issue we encountered during project installation—this arose because we utilized third-party dynamic libraries in the project. You may want to try out the forked branch I built based on your 0.6.8 version:
I originally intended to submit a PR for these changes, but ultimately abandoned the idea due to the extensive scope of the modifications. You can check out this branch here: https://github.com/zhuzhuyule/Votype/tree/votype
|
Thanks for the plan! I totally support the idea of separating engines — it makes the architecture much cleaner. One thing I'd like to share:I chose sherpa-onnx specifically because it already supports a huge variety of languages and models (100+ languages with Paraformer/Zipformer). While it may not match the quality of the latest SOTA models, it's practically "good enough" for most use cases and covers far more languages than whisper.cpp alone. This makes me wonder:Should broad language coverage be a high-priority goal for this project? If so, onnx (via sherpa-onnx) might deserve some extra attention in the new engine architecture. The main downside of sherpa-onnx is that sherpa-rs-sys can be a bit tricky to install. Do you have any thoughts on how to handle that in the new setup? Or maybe there's a cleaner way to package the sherpa dependencies? |
|
Largely I love sherpa-onnx as well, and have used it in other projects. I mostly didn't pull it in due to dep issues I ran into when trying to use it in Handy. And at this point AI can more or less reimplement inference engines based on another reference. Basically it's possible to automate porting from transformers, or sherpa-onnx more or less, and at the moment that seems to be a better solution to me. Just because of all these dep nightmares. I would rather contain the dependencies to a known tree and build from that. Broad language coverage is a goal for sure. Perhaps the bindings to sherpa are just not very good and there's a better way to build/distribute them. I've just not taken a deep look yet. But since most everything is onnx anyway, porting is fairly straightforward and honestly prefer this way. There's probably fairly low hanging fruit in terms of automating this pipeline too.. Point at a transformers model and output:
Not just in terms of porting weights to the respective formats, but also automating the actual inference code generation too into a variety of languages. You could imagine this being done for Rust (like here), C/C++, Golang, Swift, JS/TS, etc. And have the logits verified. I think this is reasonable enough to do, and is a direction I'm thinking a lot about. If you want to help, would love to discuss further |
|
@zhuzhuyule for what it's worth I did the base level refactor. would love if you want to move this code into the new format. should be fairly straightforward I imagine. |
|
The zipformer models support my native language, would love this feature to be merged! |
|
I suggest that you use You can find doc at and examples at |
|
I think this is also probably the way forward @csukuangfj just need to test it plays nicely with the Handy CI/CD at this point. That almost certainly was one of the original blockers Thanks for all the work you and your team do, sherpa-onnx is wonderful. It was very fun playing with it on some RK based boards recently, and using the NPU :) For what it's worth if we add sherpa-onnx, which we probably should, it should be a new engine type. We may still choose to implement the ONNX ourselves, but being able to use the upstream would be much nicer on average. Also quite frankly more trustworthy than our implementations until we get better validation/verification of our own implementations |
Port compute_fbank_kaldi from backup branch as compute_kaldi_fbank with KaldiFbankConfig (sample_rate u32, Povey window, DC removal, natural log, negative high_freq Kaldi convention). Registered in features::mod.rs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Non-autoregressive ASR model with custom fbank (Hamming/dB scale), LFR stacking, mean-only CMVN, and @@-subword symbol table decoding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements ZipformerCtcModel with SpeechModel trait, Kaldi fbank feature extraction, and CTC greedy decode using BbpeSymbolTable. Supports both standard model.onnx naming and sherpa-onnx directory-scan fallback. Rejects streaming models that contain cached_* inputs at load time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three-session RNN-T architecture (encoder, decoder, joiner) with greedy search decoding. Auto-detects I/O names and model file naming conventions. Rejects streaming models at load time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements PunctModel backed by CT-Transformer ONNX, with sliding-window inference (20-token chunks, 2-token overlap) and smart CJK/ASCII punctuation selection. Adds independent `punct` feature gate and updates `all`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add example binaries and integration tests for ParaformerModel, ZipformerCtcModel, and ZipformerTransducerModel following the existing gigaam/sense_voice patterns. Tests skip gracefully when model files are absent; examples accept positional args and --int8 flag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix loop variable indexing in kaldi_fbank.rs - Apply cargo fmt to all new files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CT-Transformer punctuation model expects int32 input tensors, but the code was casting token IDs from i32 to i64. Use i32 directly for both input_array and length_array. Also make output extraction flexible (try i64 first, fall back to i32) since different model versions may output different types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CT-Transformer punct model outputs float32 logits with shape [batch, seq_len, num_classes=6], not pre-argmaxed integers. Apply argmax along the last axis to get punctuation class IDs. Fall back to i64/i32 extraction for models that output pre-argmaxed values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduce log noise during normal operation: - ONNX session model input/output tensor info → DEBUG - BBPE encoding detection → DEBUG - Punct model token count and input names → DEBUG - Zipformer model file discovery → DEBUG Error and warning logs (model load failures, inference errors) remain at WARN/ERROR level for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update: v0.3.5 — Punct Model Fixes, GPU Accel, Log CleanupWhat changed since initial PR
Recommended Usage PatternEngines that output punctuated text (no external punct needed):
Engines that output raw text without punctuation (need external punct model):
Auto-detect + apply pattern: let result = engine.transcribe(&audio, &TranscribeOptions::default())?;
// Check if output already has punctuation
let has_punct = result.text.chars().any(|c| matches!(c,
',' | '。' | '?' | '!' | ';' | ',' | '.' | '?' | '!' | ';'
));
if !has_punct && !result.text.is_empty() {
let mut punct = PunctModel::new(Path::new("models/punct-model/"))?;
let punctuated = punct.add_punctuation(&result.text);
// Use punctuated text
}Notes & Caveats
|
- Remove llm_postprocess module (not yet ported, broke example build) - Remove stale docs and plan files - Fix clippy skip(0) warning in punct.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>


Summary
Add sherpa-onnx speech recognition engines, neural punctuation restoration, GPU acceleration, and upgrade core to v0.3.5.
New Engines
paraformerParaformerModelzipformer-ctcZipformerCtcModelzipformer-transducerZipformerTransducerModelpunctPunctModelaccelmodulePunctuation Model — Usage
ASR engines fall into two categories:
Already punctuated (skip punct model): Whisper, SenseVoice
Raw text, no punctuation (need punct model): Zipformer, Paraformer, GigaAM
Recommended pattern — auto-detect and apply:
GPU Acceleration
Key Implementation Details
encoder-epoch-34-avg-19.int8.onnx,encoder.int8.onnx, etc.)bbpe.modelfile presenceSpeechModeltraitTested Models
Notes & Caveats
ZipformerTransducerModel,ZipformerCtcModel,ParaformerModelrun full-audio offline inference. Streaming model files may load but are not properly supported — use offline variants only.add_punctuation()processes text in 20-token windows with 2-token overlap. For realtime preview, callers should manage their own caching/anchoring strategy externally.int32tensors. Output isfloat32logits[batch, seq_len, 6]— the library handles argmax internally.Test plan
cargo check --features all— clean🤖 Generated with Claude Code