feat: add Paraformer, Zipformer, Punctuation, GPU acceleration (v0.3.5) by zhuzhuyule · Pull Request #42 · cjpais/transcribe-rs

zhuzhuyule · 2026-02-27T10:39:06Z

Updated 2026-03-30: Added punct model I/O fixes (int32 input, f32 argmax output), GPU accel module, log noise reduction. See latest comment for details.

Summary

Add sherpa-onnx speech recognition engines, neural punctuation restoration, GPU acceleration, and upgrade core to v0.3.5.

New Engines

Feature Flag	Component	Type
`paraformer`	`ParaformerModel`	Single ONNX, non-autoregressive
`zipformer-ctc`	`ZipformerCtcModel`	Single ONNX, CTC greedy decode
`zipformer-transducer`	`ZipformerTransducerModel`	3 ONNX (encoder/decoder/joiner), RNN-T greedy search
`punct`	`PunctModel`	CT-Transformer punctuation restoration (zh+en)
(core)	`accel` module	GPU device enumeration and accelerator selection

Punctuation Model — Usage

ASR engines fall into two categories:

Already punctuated (skip punct model): Whisper, SenseVoice
Raw text, no punctuation (need punct model): Zipformer, Paraformer, GigaAM

Recommended pattern — auto-detect and apply:

use transcribe_rs::punct::PunctModel;

// 1. Transcribe
let result = engine.transcribe(&audio, &TranscribeOptions::default())?;

// 2. Check if output already has punctuation
let has_punct = result.text.chars().any(|c| matches!(c,
    '，' | '。' | '？' | '！' | ',' | '.' | '?' | '!'
));

// 3. Apply punct model only when needed
if !has_punct && !result.text.is_empty() {
    let mut punct = PunctModel::new(Path::new("models/punct-model/"))?;
    let punctuated = punct.add_punctuation(&result.text);
}

GPU Acceleration

use transcribe_rs::accel;
use transcribe_rs::whisper_cpp::gpu::list_gpu_devices;

// Set accelerator preference
accel::set_whisper_accelerator(accel::WhisperAccelerator::Auto);
accel::set_ort_accelerator(accel::OrtAccelerator::Auto);

// Enumerate and select GPU
let devices = list_gpu_devices();
if !devices.is_empty() {
    accel::set_whisper_gpu_device(devices[0].id);
}

// Query available accelerators
let available = accel::OrtAccelerator::available(); // ["auto", "cpu", "cuda", ...]

Key Implementation Details

Auto-detect model files: Handles varying naming conventions (encoder-epoch-34-avg-19.int8.onnx, encoder.int8.onnx, etc.)
Auto-detect token encoding: BBPE vs standard BPE, detected via bbpe.model file presence
Mixed quantization: int8 encoder + fp32 decoder handled transparently
Unified API: All engines implement the SpeechModel trait
External post-processing: Punctuation is caller-controlled, not embedded in engines

Tested Models

Model	Language	Engine	Status
sherpa-onnx-paraformer-zh-2025-10-07	Chinese	Paraformer	✅
sherpa-onnx-zipformer-ctc-small-zh-int8	Chinese	CTC	✅
sherpa-onnx-zipformer-zh-en-2023-11-22	Chinese+English	Transducer	✅
sherpa-onnx-zipformer-vi-30M-int8	Vietnamese	Transducer	✅
sherpa-onnx-zipformer-ru-int8	Russian	Transducer	✅
sherpa-onnx-zipformer-korean-2024-06-24	Korean	Transducer	✅
punct-ct-transformer-zh-en-int8	Chinese+English	Punct	✅

Notes & Caveats

Offline only: ZipformerTransducerModel, ZipformerCtcModel, ParaformerModel run full-audio offline inference. Streaming model files may load but are not properly supported — use offline variants only.
Punct model is stateless: add_punctuation() processes text in 20-token windows with 2-token overlap. For realtime preview, callers should manage their own caching/anchoring strategy externally.
CT-Transformer int8 recommended: 62MB, ~50ms per sentence. Full-precision (266MB) is marginally better but 3x slower.
Punct I/O types: Input must be int32 tensors. Output is float32 logits [batch, seq_len, 6] — the library handles argmax internally.

Test plan

cargo check --features all — clean
Transcription verified across 6+ models in 5 languages
Punctuation verified (Chinese + English, int8 and full models)
GPU enumeration tested on macOS (Metal)

🤖 Generated with Claude Code

cjpais · 2026-02-27T11:33:06Z

Let's gooooo! Thank you i will test this and try and pull it in soon

zhuzhuyule · 2026-02-27T12:54:38Z

I just casually whipped up a table：
https://www.myvibe.so/zhangfan/sherpa-onnx-asr-models

zhuzhuyule · 2026-02-28T03:02:22Z

Closing to recreate from a dedicated feature branch (this PR's head was fork/main which now contains unrelated changes).

cjpais · 2026-03-01T04:41:59Z

Okay, when I'm looking at this, it's becoming increasingly obvious we need to significantly modify the codebase. I am going to do this, and then let's get these in. If you don't mind waiting and rebasing on top of this, it would be great.

Basically I want to separate things out into engines

whisper.cpp
whisperfile
onnx
mlx?
ggml?

or similar, so we can then implement models per engine as well. i think this will be a much better way forward, but will require some better documentation. I think we can get something going like auto model porting as well from a given base implementation (usually hf transformers). We can potentially try and support the transformers implementations too, but largely I'm not super focused on that for the moment.

zhuzhuyule · 2026-03-01T16:27:57Z

Before discovering your project, I actually used the sherpa-rs-sys crate, which worked exceptionally well. It not only supported streaming transcription but also allowed the integration of a wider range of models. The only drawback was the third-party code signing issue we encountered during project installation—this arose because we utilized third-party dynamic libraries in the project.

You may want to try out the forked branch I built based on your 0.6.8 version:

I initially added support for cloud-based speech models due to the low hardware specifications of my device at the time.
Later on, I further implemented streaming models for real-time transcription.
Additional features include a demo showcase, transcription history analysis, and more.
I also fully refactored the UI/UX of the application.

I originally intended to submit a PR for these changes, but ultimately abandoned the idea due to the extensive scope of the modifications.

You can check out this branch here: https://github.com/zhuzhuyule/Votype/tree/votype

zhuzhuyule · 2026-03-01T16:40:50Z

Okay, when I'm looking at this, it's becoming increasingly obvious we need to significantly modify the codebase. I am going to do this, and then let's get these in. If you don't mind waiting and rebasing on top of this, it would be great.

Basically I want to separate things out into engines

whisper.cpp

whisperfile
...

Thanks for the plan! I totally support the idea of separating engines — it makes the architecture much cleaner.

One thing I'd like to share:I chose sherpa-onnx specifically because it already supports a huge variety of languages and models (100+ languages with Paraformer/Zipformer). While it may not match the quality of the latest SOTA models, it's practically "good enough" for most use cases and covers far more languages than whisper.cpp alone.

This makes me wonder:Should broad language coverage be a high-priority goal for this project? If so, onnx (via sherpa-onnx) might deserve some extra attention in the new engine architecture.

The main downside of sherpa-onnx is that sherpa-rs-sys can be a bit tricky to install. Do you have any thoughts on how to handle that in the new setup? Or maybe there's a cleaner way to package the sherpa dependencies?

cjpais · 2026-03-02T04:05:02Z

Largely I love sherpa-onnx as well, and have used it in other projects. I mostly didn't pull it in due to dep issues I ran into when trying to use it in Handy. And at this point AI can more or less reimplement inference engines based on another reference. Basically it's possible to automate porting from transformers, or sherpa-onnx more or less, and at the moment that seems to be a better solution to me. Just because of all these dep nightmares. I would rather contain the dependencies to a known tree and build from that.

Broad language coverage is a goal for sure.

Perhaps the bindings to sherpa are just not very good and there's a better way to build/distribute them. I've just not taken a deep look yet. But since most everything is onnx anyway, porting is fairly straightforward and honestly prefer this way. There's probably fairly low hanging fruit in terms of automating this pipeline too..

Point at a transformers model and output:

onnx
mlx
ggml
burn
candle
etc....

Not just in terms of porting weights to the respective formats, but also automating the actual inference code generation too into a variety of languages. You could imagine this being done for Rust (like here), C/C++, Golang, Swift, JS/TS, etc.

And have the logits verified. I think this is reasonable enough to do, and is a direction I'm thinking a lot about. If you want to help, would love to discuss further

cjpais · 2026-03-13T13:19:41Z

@zhuzhuyule for what it's worth I did the base level refactor. would love if you want to move this code into the new format. should be fairly straightforward I imagine.

kakapt · 2026-03-19T01:37:40Z

The zipformer models support my native language, would love this feature to be merged!

csukuangfj · 2026-03-22T04:53:57Z

I suggest that you use
https://crates.io/crates/sherpa-onnx

You can find doc at
https://k2-fsa.github.io/sherpa/onnx/rust-api/install.html

and examples at
https://github.com/k2-fsa/sherpa-onnx/tree/master/rust-api-examples

cjpais · 2026-03-22T08:00:38Z

I think this is also probably the way forward @csukuangfj just need to test it plays nicely with the Handy CI/CD at this point. That almost certainly was one of the original blockers

Thanks for all the work you and your team do, sherpa-onnx is wonderful. It was very fun playing with it on some RK based boards recently, and using the NPU :)

For what it's worth if we add sherpa-onnx, which we probably should, it should be a new engine type. We may still choose to implement the ONNX ourselves, but being able to use the upstream would be much nicer on average. Also quite frankly more trustworthy than our implementations until we get better validation/verification of our own implementations

Port compute_fbank_kaldi from backup branch as compute_kaldi_fbank with KaldiFbankConfig (sample_rate u32, Povey window, DC removal, natural log, negative high_freq Kaldi convention). Registered in features::mod.rs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Non-autoregressive ASR model with custom fbank (Hamming/dB scale), LFR stacking, mean-only CMVN, and @@-subword symbol table decoding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements ZipformerCtcModel with SpeechModel trait, Kaldi fbank feature extraction, and CTC greedy decode using BbpeSymbolTable. Supports both standard model.onnx naming and sherpa-onnx directory-scan fallback. Rejects streaming models that contain cached_* inputs at load time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three-session RNN-T architecture (encoder, decoder, joiner) with greedy search decoding. Auto-detects I/O names and model file naming conventions. Rejects streaming models at load time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements PunctModel backed by CT-Transformer ONNX, with sliding-window inference (20-token chunks, 2-token overlap) and smart CJK/ASCII punctuation selection. Adds independent `punct` feature gate and updates `all`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add example binaries and integration tests for ParaformerModel, ZipformerCtcModel, and ZipformerTransducerModel following the existing gigaam/sense_voice patterns. Tests skip gracefully when model files are absent; examples accept positional args and --int8 flag. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix loop variable indexing in kaldi_fbank.rs - Apply cargo fmt to all new files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The CT-Transformer punctuation model expects int32 input tensors, but the code was casting token IDs from i32 to i64. Use i32 directly for both input_array and length_array. Also make output extraction flexible (try i64 first, fall back to i32) since different model versions may output different types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The CT-Transformer punct model outputs float32 logits with shape [batch, seq_len, num_classes=6], not pre-argmaxed integers. Apply argmax along the last axis to get punctuation class IDs. Fall back to i64/i32 extraction for models that output pre-argmaxed values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reduce log noise during normal operation: - ONNX session model input/output tensor info → DEBUG - BBPE encoding detection → DEBUG - Punct model token count and input names → DEBUG - Zipformer model file discovery → DEBUG Error and warning logs (model load failures, inference errors) remain at WARN/ERROR level for visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zhuzhuyule · 2026-03-30T00:43:37Z

Update: v0.3.5 — Punct Model Fixes, GPU Accel, Log Cleanup

What changed since initial PR

Punct model I/O fix: Input tensors now use int32 (was incorrectly cast to int64). Output extraction handles float32 logits via argmax (model outputs [batch, seq_len, 6] probabilities, not pre-argmaxed integers).
GPU acceleration module (accel): set_whisper_accelerator(), set_ort_accelerator(), set_whisper_gpu_device(), and list_gpu_devices() for runtime GPU selection.
Log noise reduction: ONNX session tensor info, BBPE detection, punct token loading, and model file discovery logs downgraded from INFO → DEBUG.

Recommended Usage Pattern

Engines that output punctuated text (no external punct needed):

Whisper (all variants)
SenseVoice

Engines that output raw text without punctuation (need external punct model):

Zipformer Transducer
Zipformer CTC
Paraformer
GigaAM

Auto-detect + apply pattern:

let result = engine.transcribe(&audio, &TranscribeOptions::default())?;

// Check if output already has punctuation
let has_punct = result.text.chars().any(|c| matches!(c, 
    '，' | '。' | '？' | '！' | '；' | ',' | '.' | '?' | '!' | ';'
));

if !has_punct && !result.text.is_empty() {
    let mut punct = PunctModel::new(Path::new("models/punct-model/"))?;
    let punctuated = punct.add_punctuation(&result.text);
    // Use punctuated text
}

Notes & Caveats

Streaming models not supported: Current ZipformerTransducerModel / ZipformerCtcModel / ParaformerModel load the full audio and run offline inference. Streaming model files (filename containing streaming) will load but may produce incorrect results — callers should use offline model variants only.
Punct model is stateless per-call: add_punctuation() processes text in 20-token windows with 2-token overlap. For realtime preview, callers should manage their own caching/anchoring strategy.
CT-Transformer int8 model recommended: 62MB, fast (~50ms for typical sentences). Full-precision model (266MB) gives marginally better accuracy but 3x slower.

- Remove llm_postprocess module (not yet ported, broke example build) - Remove stale docs and plan files - Fix clippy skip(0) warning in punct.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zhuzhuyule closed this Feb 28, 2026

zhuzhuyule mentioned this pull request Feb 28, 2026

feat: add Paraformer engine with punctuation support #44

Closed

4 tasks

zhuzhuyule reopened this Feb 28, 2026

zhuzhuyule changed the title ~~feat: add Paraformer engine with punctuation support~~ feat: add Paraformer, Zipformer CTC & Transducer engines with punctuation support Feb 28, 2026

zhuzhuyule mentioned this pull request Feb 28, 2026

feat: add Zipformer CTC and Transducer engines #43

Closed

4 tasks

zhuzhuyule force-pushed the main branch from 43afdfe to 182b879 Compare March 4, 2026 13:40

zhuzhuyule mentioned this pull request Mar 4, 2026

refactor: decouple punct from ASR engines zhuzhuyule/transcribe-rs#1

Closed

4 tasks

zhuzhuyule and others added 11 commits March 29, 2026 18:28

feat: add BBPE symbol table for Icefall/sherpa-onnx models

1d0f5f4

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add Paraformer ONNX engine

8081c16

Non-autoregressive ASR model with custom fbank (Hamming/dB scale), LFR stacking, mean-only CMVN, and @@-subword symbol table decoding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: fix clippy warnings and cargo fmt

e796f50

- Fix loop variable indexing in kaldi_fbank.rs - Apply cargo fmt to all new files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zhuzhuyule force-pushed the main branch from 4e6c03b to 006726c Compare March 30, 2026 00:32

zhuzhuyule changed the title ~~feat: add Paraformer, Zipformer CTC & Transducer engines with punctuation support~~ feat: add Paraformer, Zipformer, Punctuation, GPU acceleration (v0.3.5) Mar 30, 2026

chore: remove accidentally committed files and fix clippy

bfab8ce

- Remove llm_postprocess module (not yet ported, broke example build) - Remove stale docs and plan files - Fix clippy skip(0) warning in punct.rs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Paraformer, Zipformer, Punctuation, GPU acceleration (v0.3.5)#42

feat: add Paraformer, Zipformer, Punctuation, GPU acceleration (v0.3.5)#42
zhuzhuyule wants to merge 12 commits into
cjpais:mainfrom
zhuzhuyule:main

zhuzhuyule commented Feb 27, 2026 •

edited

Loading

Uh oh!

cjpais commented Feb 27, 2026

Uh oh!

zhuzhuyule commented Feb 27, 2026

Uh oh!

zhuzhuyule commented Feb 28, 2026

Uh oh!

cjpais commented Mar 1, 2026 •

edited

Loading

Uh oh!

zhuzhuyule commented Mar 1, 2026

Uh oh!

zhuzhuyule commented Mar 1, 2026

Uh oh!

cjpais commented Mar 2, 2026 •

edited

Loading

Uh oh!

cjpais commented Mar 13, 2026

Uh oh!

kakapt commented Mar 19, 2026

Uh oh!

csukuangfj commented Mar 22, 2026

Uh oh!

cjpais commented Mar 22, 2026 •

edited

Loading

Uh oh!

zhuzhuyule commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhuzhuyule commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Engines

Punctuation Model — Usage

GPU Acceleration

Key Implementation Details

Tested Models

Notes & Caveats

Test plan

Uh oh!

cjpais commented Feb 27, 2026

Uh oh!

zhuzhuyule commented Feb 27, 2026

Uh oh!

zhuzhuyule commented Feb 28, 2026

Uh oh!

cjpais commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuzhuyule commented Mar 1, 2026

Uh oh!

zhuzhuyule commented Mar 1, 2026

Uh oh!

cjpais commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjpais commented Mar 13, 2026

Uh oh!

kakapt commented Mar 19, 2026

Uh oh!

csukuangfj commented Mar 22, 2026

Uh oh!

cjpais commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuzhuyule commented Mar 30, 2026

Update: v0.3.5 — Punct Model Fixes, GPU Accel, Log Cleanup

What changed since initial PR

Recommended Usage Pattern

Notes & Caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhuzhuyule commented Feb 27, 2026 •

edited

Loading

cjpais commented Mar 1, 2026 •

edited

Loading

cjpais commented Mar 2, 2026 •

edited

Loading

cjpais commented Mar 22, 2026 •

edited

Loading