Live speech-to-text streaming on Apple Silicon. One command. No API keys. No cloud.
TextStream turns your Mac's microphone into a live transcription endpoint. It runs Qwen3-ASR on-device through MLX, filters noise with Silero VAD, and streams text over SSE at localhost:7890/stream. Any app, script, or frontend can subscribe and get words as they are spoken — with ~2% word error rate, no API keys, and zero cost.
Install · Quick Start · How It Works · API · Benchmarks · Configuration · Contributing
Cloud speech APIs charge per minute and add network latency. Whisper runs offline but is not real-time. There is no simple way to get a local, streaming transcription endpoint that any process on your machine can read from.
TextStream fills that gap. One pip install, one command, and every app on your machine has access to a live transcript stream — for free.
Build voice-controlled tools. Add live captions to your app. Record meeting notes that write themselves. Pipe speech into your IDE. Whatever needs ears, point it at the stream.
pip install textstream-asrRequirements: macOS on Apple Silicon (M1/M2/M3/M4), Python 3.10+.
textstream # start transcribing, opens browser UI
textstream --no-browser # headless — SSE server only
textstream --engine qwen-1.7b # larger model, lower word error rate
textstream --vad-threshold 0.5 # stricter voice detection (default 0.4)Python:
import json, urllib.request
req = urllib.request.Request("http://localhost:7890/stream")
with urllib.request.urlopen(req) as resp:
for line in resp:
line = line.decode().strip()
if line.startswith("data: "):
event = json.loads(line[6:])
if event["type"] == "stream":
print(event["finalized"], event["draft"])JavaScript:
const src = new EventSource("http://localhost:7890/stream");
src.onmessage = (e) => {
const { finalized, draft } = JSON.parse(e.data);
console.log(finalized, draft);
};Every --interval seconds (default 2.5), TextStream drains the mic buffer and runs Silero VAD on the chunk. If speech is detected, the chunk goes to Qwen3-ASR's streaming decoder. The model returns stable (finalized) text and speculative (draft) text. Stable text gets persisted to disk and broadcast to all SSE subscribers.
If the model hallucinates on noise that slips past VAD, a pattern filter catches it and resets the stream. With VAD active, this almost never fires.
Microphone → Audio Buffer → Silero VAD → Qwen3-ASR (MLX) → SSE Stream
↓ (no speech)
Skip chunk
| Endpoint | Description |
|---|---|
GET /stream |
SSE stream: {"type":"stream","finalized":"...","draft":"..."} |
GET /engine |
Current engine info |
GET /switch?engine=qwen-1.7b |
Hot-swap model without restart |
GET /pause |
Pause mic capture |
GET /resume |
Resume mic capture |
GET /stop |
Shutdown server |
GET / |
Built-in browser UI |
| Model | LibriSpeech clean | LibriSpeech other | Params |
|---|---|---|---|
| Qwen3-ASR 0.6B (default) | 2.11% | 4.55% | 600M |
| Qwen3-ASR 1.7B | 1.63% | 3.38% | 1.7B |
| Whisper-large-v3 | 1.51% | 3.97% | 1.5B |
| GPT-4o-Transcribe | 1.39% | 3.75% | -- |
Source: Qwen3-ASR Technical Report
| Metric | Value |
|---|---|
| Real-time factor (RTF) | ~0.06 (16x faster than real-time) |
| MLX vs PyTorch | ~4x faster on Apple Silicon |
| VAD latency | <1ms per 32ms audio chunk |
| Time to first token | ~92ms |
Source: mlx-qwen3-asr benchmarks, Silero VAD metrics
- RAM: ~1.2 GB for 0.6B model, ~3 GB for 1.7B
- CPU/GPU: Runs on Neural Engine + GPU via MLX Metal backend. Minimal CPU overhead.
- Disk: Models cached by HuggingFace Hub (~1.2 GB / 3.4 GB first download)
- Battery: Comparable to background music playback. MLX is designed for Apple Silicon power efficiency.
- Real-time streaming ASR via Server-Sent Events at
localhost:7890/stream - Qwen3-ASR on MLX — 2% WER, 16x faster than real-time on Apple Silicon
- Silero VAD filters silence and noise before transcription runs
- Hot-swap models between 0.6B and 1.7B without restarting the server
- Built-in browser UI for quick visual monitoring
- Hallucination filter catches and resets repetitive model output
- Auto-saves transcripts to
~/Documents/textstream/transcripts/ - Zero dependencies on cloud services — runs entirely on-device
| Flag | Default | Description |
|---|---|---|
--port |
7890 |
HTTP server port |
--engine |
qwen |
qwen (0.6B) or qwen-1.7b |
--interval |
2.5 |
Seconds between transcription updates |
--vad-threshold |
0.4 |
Silero VAD speech probability threshold |
--no-browser |
-- | Do not open browser on start |
Transcripts are saved to ~/Documents/textstream/transcripts/YYYY-MM-DD/.
- MLX — Apple's ML framework for Apple Silicon
- mlx-qwen3-asr — Qwen3-ASR ported to MLX
- silero-vad-lite — Voice activity detection (~2 MB, bundles ONNX runtime)
- sounddevice — PortAudio bindings for mic capture
- NumPy
Contributions are welcome. See CONTRIBUTING.md for guidelines.
Built by Boris Djordjevic at 199 Biotechnologies | Paperfoot AI
