Add Voxtral Realtime STT support via voxmlx by vr000m · Pull Request #8 · pipecat-ai/pipecat-mcp-server

vr000m · 2026-02-14T21:00:45Z

Summary

Add Voxtral Realtime STT support via voxmlx, optimized for Apple Silicon (MLX)
Segmented mode (STT_SERVICE=voxtral): Batch transcription after VAD detects end of speech. ~0.30s per utterance at 480ms delay.
Streaming mode (STT_SERVICE=voxtral-streaming): Incremental encode/decode during speech via background thread. ~0.16s from end-of-turn to final transcription.
Configurable delay via VOXTRAL_DELAY_MS env var (default 480, must be multiple of 80)
Add .env.example with all supported environment variables and STT/TTS configuration docs to README

Changes

New files

src/pipecat_mcp_server/processors/voxtral_stt.py — Segmented STT (extends SegmentedSTTService)
src/pipecat_mcp_server/processors/voxtral_streaming_stt.py — Streaming STT (extends STTService) with background thread for MLX operations
.env.example — Documented environment variables for STT, TTS, and transport configuration

Modified files

src/pipecat_mcp_server/agent.py — STT service selection via STT_SERVICE env var, _parse_voxtral_delay_ms() validation, load_dotenv(override=False)
src/pipecat_mcp_server/agent_ipc.py — load_dotenv(override=False) to avoid overwriting explicit env vars
pyproject.toml — Add voxmlx dependency (macOS only)
.gitignore — Add uv.lock, add !.env.example exception
README.md — Add Configuration section with STT/TTS options tables

Architecture (streaming mode)

Pipeline (async):                          Background thread (sync):
  AudioRawFrame                             while running:
    → process_audio_frame()                   drain audio_in_queue → pending_audio
      → audio_in_queue.put(samples)           encode (mel_step → encode_step)
      → drain token_out_queue                 decode (prefill → decode_steps)
        → push InterimTranscriptionFrame      token_out_queue.put(token_id or EOS)
        → push TranscriptionFrame             sleep(0.01)

Key design decisions

Background thread for MLX ops (GPU-bound, would block async event loop)
queue.Queue for thread-safe async↔sync communication
_flush_until_eos() polls token queue after VAD stop to ensure final transcript is emitted
mx.clear_cache() after every mx.eval() to prevent MLX memory pool leaks
O(1) incremental token decoding via _partial_text accumulator
Thread crash handling via _THREAD_ERROR sentinel

Test plan

macOS (Apple Silicon)

Set STT_SERVICE=voxtral — verify segmented mode transcribes correctly
Set STT_SERVICE=voxtral-streaming — verify streaming mode emits interim + final transcriptions
Verify memory stability over multiple utterances (no growth beyond ~65MB)
Test interruption mid-sentence — verify clean state reset
Test invalid VOXTRAL_DELAY_MS values — verify validation and rounding
Default (no STT_SERVICE) — verify Whisper MLX fallback still works

Non-darwin (Linux)

Verify default STT (WhisperSTTService with faster-distil-whisper-large-v3) still works
Verify voxmlx dependency is not pulled in (conditional on sys_platform == 'darwin')
Verify setting STT_SERVICE=voxtral or voxtral-streaming fails gracefully on Linux (import error from voxmlx/MLX)

The child process created via multiprocessing.spawn doesn't inherit the parent's .env file. Load dotenv early in run_pipecat_process() so environment variables like DAILY_API_KEY are available before the Pipecat runner reads them.

Add VoxtralSTTService as an opt-in alternative to Whisper on macOS. Set STT_SERVICE=voxtral to use Mistral's Voxtral Realtime 4B model for local speech-to-text on Apple Silicon. Whisper remains the default to preserve backwards compatibility. Uses the voxmlx library which provides a lightweight MLX-optimized implementation. Model, tokenizer, and prompt tokens are cached across transcription calls to avoid redundant work.

Switch from the 6-bit quantized model to the original full precision mistralai/Voxtral-Mini-4B-Realtime-2602 for best transcription quality.

This reverts commit e455e9d.

Allow overriding the transcription delay via VOXTRAL_DELAY_MS environment variable (default 480ms). Lower delay (e.g. 160ms) gives faster response at slightly reduced accuracy.

New VoxtralStreamingSTTService extends STTService (not SegmentedSTTService) for incremental encode/decode during speech. Uses a background thread with two queues for thread-safe audio/token communication. Set STT_SERVICE=voxtral-streaming to enable. Uses the same VOXTRAL_DELAY_MS env var as segmented mode. Includes memory leak fixes: mx.clear_cache() after encode steps, explicit KV cache release on reset, and pending_audio cap.

- Revert daily from base dependency to optional (pyproject.toml) - Add hasattr guards on KV cache release in reset_state() - Add _parse_voxtral_delay_ms() validation for env var - Rewrite _drain_token_queue for O(1) incremental decoding - Replace run_stt noop pattern with `if False: yield` - Add try/except + _THREAD_ERROR sentinel for thread crash handling - Remove uv.lock from tracking, add to .gitignore - Fix dropped transcript on end-of-utterance: poll _drain_token_queue via _flush_until_eos() after VADUserStoppedSpeakingFrame so the final TranscriptionFrame is emitted even when no more audio frames arrive - Change load_dotenv(override=True) to override=False in agent.py and agent_ipc.py so explicit env vars are not overwritten by .env

vr000m added 8 commits February 14, 2026 12:41

Fix env var propagation in spawned subprocess

d86a342

The child process created via multiprocessing.spawn doesn't inherit the parent's .env file. Load dotenv early in run_pipecat_process() so environment variables like DAILY_API_KEY are available before the Pipecat runner reads them.

Use full precision Voxtral model as default

e455e9d

Switch from the 6-bit quantized model to the original full precision mistralai/Voxtral-Mini-4B-Realtime-2602 for best transcription quality.

Revert "Use full precision Voxtral model as default"

26f80b0

This reverts commit e455e9d.

voxtral: add configurable delay_ms parameter

dd90e2f

Allow overriding the transcription delay via VOXTRAL_DELAY_MS environment variable (default 480ms). Lower delay (e.g. 160ms) gives faster response at slightly reduced accuracy.

style: apply ruff formatter

d07f4a3

vr000m requested a review from aconchillo February 15, 2026 22:39

docs: add .env.example and STT/TTS configuration to README

b134f12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Voxtral Realtime STT support via voxmlx#8

Add Voxtral Realtime STT support via voxmlx#8
vr000m wants to merge 9 commits intopipecat-ai:mainfrom
vr000m:feature/voxmlx-stt-support

vr000m commented Feb 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vr000m commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New files

Modified files

Architecture (streaming mode)

Key design decisions

Test plan

macOS (Apple Silicon)

Non-darwin (Linux)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vr000m commented Feb 14, 2026 •

edited

Loading