Evaluate Granite-4.0-1B-Speech as alternative STT engine

## Summary

IBM's [Granite-4.0-1B-Speech](https://huggingface.co/ibm-granite/granite-4.0-1b-speech) (~2B params, Apache 2.0) achieves state-of-the-art English ASR with **5.52 WER** on the OpenASR leaderboard — roughly 2 points better than Whisper Large V3 (~7.4 WER). It supports 7 languages including Portuguese, and runs at 280x real-time on GPU.

This model is worth evaluating as a future alternative or complement to our current Whisper.cpp-based transcription pipeline.

## Current Blockers

The following requirements must be met before integration is practical:

- [ ] **Native runtime availability** — No C/C++ runtime exists today (no whisper.cpp equivalent). Integration would require orchestrating a multi-component ONNX pipeline (encoder + projector + decoder) via the `ort` Rust crate, which is significantly more complex than our current whisper-rs setup.
- [ ] **Streaming / partial transcription support** — Granite currently only supports batch inference. Meetily relies on real-time partial transcriptions with VAD-filtered audio chunks. Without streaming, it cannot replace our current pipeline.
- [ ] **Mature Rust/native bindings** — No dedicated Rust bindings exist. The ecosystem is Python-first (HuggingFace Transformers, vLLM).

## When to Revisit

This issue should be revisited if any of the following occur:

- IBM or the community releases a lightweight C/C++ inference runtime
- Streaming/chunked inference support is added to the model
- A Rust crate wrapping Granite Speech inference becomes available
- Meetily's architecture changes to support a server-side transcription backend (where Python/vLLM would be acceptable)

## Key Comparisons

| Factor | Granite 4.0 1B Speech | Whisper.cpp (current) |
|--------|----------------------|----------------------|
| English WER | **5.52** | ~7.4 (Large V3) |
| Languages | 7 | 99+ |
| Native C/C++ runtime | None | Yes |
| Streaming support | No | Yes |
| Memory (~fp16) | ~4 GB | ~1.5 GB (Large V3) |
| License | Apache 2.0 | MIT |

## References

- [HuggingFace Model Card](https://huggingface.co/ibm-granite/granite-4.0-1b-speech)
- [ONNX Community Conversion](https://huggingface.co/onnx-community/granite-4.0-1b-speech-ONNX)
- [IBM Granite Speech GitHub](https://github.com/ibm-granite/granite-speech-models)
- [OpenASR Benchmarks (Northflank)](https://northflank.com/blog/best-open-source-speech-to-text-stt-model-in-2026-benchmarks)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Granite-4.0-1B-Speech as alternative STT engine #408

Summary

Current Blockers

When to Revisit

Key Comparisons

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Factor	Granite 4.0 1B Speech	Whisper.cpp (current)
English WER	5.52	~7.4 (Large V3)
Languages	7	99+
Native C/C++ runtime	None	Yes
Streaming support	No	Yes
Memory (~fp16)	~4 GB	~1.5 GB (Large V3)
License	Apache 2.0	MIT

Evaluate Granite-4.0-1B-Speech as alternative STT engine #408

Description

Summary

Current Blockers

When to Revisit

Key Comparisons

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions