Skip to content

Add Voxtral Realtime STT support via voxmlx#8

Open
vr000m wants to merge 9 commits intopipecat-ai:mainfrom
vr000m:feature/voxmlx-stt-support
Open

Add Voxtral Realtime STT support via voxmlx#8
vr000m wants to merge 9 commits intopipecat-ai:mainfrom
vr000m:feature/voxmlx-stt-support

Conversation

@vr000m
Copy link

@vr000m vr000m commented Feb 14, 2026

Summary

  • Add Voxtral Realtime STT support via voxmlx, optimized for Apple Silicon (MLX)
  • Segmented mode (STT_SERVICE=voxtral): Batch transcription after VAD detects end of speech. ~0.30s per utterance at 480ms delay.
  • Streaming mode (STT_SERVICE=voxtral-streaming): Incremental encode/decode during speech via background thread. ~0.16s from end-of-turn to final transcription.
  • Configurable delay via VOXTRAL_DELAY_MS env var (default 480, must be multiple of 80)
  • Add .env.example with all supported environment variables and STT/TTS configuration docs to README

Changes

New files

  • src/pipecat_mcp_server/processors/voxtral_stt.py — Segmented STT (extends SegmentedSTTService)
  • src/pipecat_mcp_server/processors/voxtral_streaming_stt.py — Streaming STT (extends STTService) with background thread for MLX operations
  • .env.example — Documented environment variables for STT, TTS, and transport configuration

Modified files

  • src/pipecat_mcp_server/agent.py — STT service selection via STT_SERVICE env var, _parse_voxtral_delay_ms() validation, load_dotenv(override=False)
  • src/pipecat_mcp_server/agent_ipc.pyload_dotenv(override=False) to avoid overwriting explicit env vars
  • pyproject.toml — Add voxmlx dependency (macOS only)
  • .gitignore — Add uv.lock, add !.env.example exception
  • README.md — Add Configuration section with STT/TTS options tables

Architecture (streaming mode)

Pipeline (async):                          Background thread (sync):
  AudioRawFrame                             while running:
    → process_audio_frame()                   drain audio_in_queue → pending_audio
      → audio_in_queue.put(samples)           encode (mel_step → encode_step)
      → drain token_out_queue                 decode (prefill → decode_steps)
        → push InterimTranscriptionFrame      token_out_queue.put(token_id or EOS)
        → push TranscriptionFrame             sleep(0.01)

Key design decisions

  • Background thread for MLX ops (GPU-bound, would block async event loop)
  • queue.Queue for thread-safe async↔sync communication
  • _flush_until_eos() polls token queue after VAD stop to ensure final transcript is emitted
  • mx.clear_cache() after every mx.eval() to prevent MLX memory pool leaks
  • O(1) incremental token decoding via _partial_text accumulator
  • Thread crash handling via _THREAD_ERROR sentinel

Test plan

macOS (Apple Silicon)

  • Set STT_SERVICE=voxtral — verify segmented mode transcribes correctly
  • Set STT_SERVICE=voxtral-streaming — verify streaming mode emits interim + final transcriptions
  • Verify memory stability over multiple utterances (no growth beyond ~65MB)
  • Test interruption mid-sentence — verify clean state reset
  • Test invalid VOXTRAL_DELAY_MS values — verify validation and rounding
  • Default (no STT_SERVICE) — verify Whisper MLX fallback still works

Non-darwin (Linux)

  • Verify default STT (WhisperSTTService with faster-distil-whisper-large-v3) still works
  • Verify voxmlx dependency is not pulled in (conditional on sys_platform == 'darwin')
  • Verify setting STT_SERVICE=voxtral or voxtral-streaming fails gracefully on Linux (import error from voxmlx/MLX)

The child process created via multiprocessing.spawn doesn't inherit
the parent's .env file. Load dotenv early in run_pipecat_process()
so environment variables like DAILY_API_KEY are available before
the Pipecat runner reads them.
Add VoxtralSTTService as an opt-in alternative to Whisper on macOS.
Set STT_SERVICE=voxtral to use Mistral's Voxtral Realtime 4B model
for local speech-to-text on Apple Silicon. Whisper remains the default
to preserve backwards compatibility.

Uses the voxmlx library which provides a lightweight MLX-optimized
implementation. Model, tokenizer, and prompt tokens are cached across
transcription calls to avoid redundant work.
Switch from the 6-bit quantized model to the original full precision
mistralai/Voxtral-Mini-4B-Realtime-2602 for best transcription quality.
Allow overriding the transcription delay via VOXTRAL_DELAY_MS
environment variable (default 480ms). Lower delay (e.g. 160ms)
gives faster response at slightly reduced accuracy.
New VoxtralStreamingSTTService extends STTService (not SegmentedSTTService)
for incremental encode/decode during speech. Uses a background thread with
two queues for thread-safe audio/token communication.

Set STT_SERVICE=voxtral-streaming to enable. Uses the same VOXTRAL_DELAY_MS
env var as segmented mode.

Includes memory leak fixes: mx.clear_cache() after encode steps, explicit
KV cache release on reset, and pending_audio cap.
- Revert daily from base dependency to optional (pyproject.toml)
- Add hasattr guards on KV cache release in reset_state()
- Add _parse_voxtral_delay_ms() validation for env var
- Rewrite _drain_token_queue for O(1) incremental decoding
- Replace run_stt noop pattern with `if False: yield`
- Add try/except + _THREAD_ERROR sentinel for thread crash handling
- Remove uv.lock from tracking, add to .gitignore
- Fix dropped transcript on end-of-utterance: poll _drain_token_queue
  via _flush_until_eos() after VADUserStoppedSpeakingFrame so the final
  TranscriptionFrame is emitted even when no more audio frames arrive
- Change load_dotenv(override=True) to override=False in agent.py and
  agent_ipc.py so explicit env vars are not overwritten by .env
@vr000m vr000m requested a review from aconchillo February 15, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant