Add Voxtral Realtime STT support via voxmlx#8
Open
vr000m wants to merge 9 commits intopipecat-ai:mainfrom
Open
Add Voxtral Realtime STT support via voxmlx#8vr000m wants to merge 9 commits intopipecat-ai:mainfrom
vr000m wants to merge 9 commits intopipecat-ai:mainfrom
Conversation
The child process created via multiprocessing.spawn doesn't inherit the parent's .env file. Load dotenv early in run_pipecat_process() so environment variables like DAILY_API_KEY are available before the Pipecat runner reads them.
Add VoxtralSTTService as an opt-in alternative to Whisper on macOS. Set STT_SERVICE=voxtral to use Mistral's Voxtral Realtime 4B model for local speech-to-text on Apple Silicon. Whisper remains the default to preserve backwards compatibility. Uses the voxmlx library which provides a lightweight MLX-optimized implementation. Model, tokenizer, and prompt tokens are cached across transcription calls to avoid redundant work.
Switch from the 6-bit quantized model to the original full precision mistralai/Voxtral-Mini-4B-Realtime-2602 for best transcription quality.
This reverts commit e455e9d.
Allow overriding the transcription delay via VOXTRAL_DELAY_MS environment variable (default 480ms). Lower delay (e.g. 160ms) gives faster response at slightly reduced accuracy.
New VoxtralStreamingSTTService extends STTService (not SegmentedSTTService) for incremental encode/decode during speech. Uses a background thread with two queues for thread-safe audio/token communication. Set STT_SERVICE=voxtral-streaming to enable. Uses the same VOXTRAL_DELAY_MS env var as segmented mode. Includes memory leak fixes: mx.clear_cache() after encode steps, explicit KV cache release on reset, and pending_audio cap.
- Revert daily from base dependency to optional (pyproject.toml) - Add hasattr guards on KV cache release in reset_state() - Add _parse_voxtral_delay_ms() validation for env var - Rewrite _drain_token_queue for O(1) incremental decoding - Replace run_stt noop pattern with `if False: yield` - Add try/except + _THREAD_ERROR sentinel for thread crash handling - Remove uv.lock from tracking, add to .gitignore - Fix dropped transcript on end-of-utterance: poll _drain_token_queue via _flush_until_eos() after VADUserStoppedSpeakingFrame so the final TranscriptionFrame is emitted even when no more audio frames arrive - Change load_dotenv(override=True) to override=False in agent.py and agent_ipc.py so explicit env vars are not overwritten by .env
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
STT_SERVICE=voxtral): Batch transcription after VAD detects end of speech. ~0.30s per utterance at 480ms delay.STT_SERVICE=voxtral-streaming): Incremental encode/decode during speech via background thread. ~0.16s from end-of-turn to final transcription.VOXTRAL_DELAY_MSenv var (default 480, must be multiple of 80).env.examplewith all supported environment variables and STT/TTS configuration docs to READMEChanges
New files
src/pipecat_mcp_server/processors/voxtral_stt.py— Segmented STT (extendsSegmentedSTTService)src/pipecat_mcp_server/processors/voxtral_streaming_stt.py— Streaming STT (extendsSTTService) with background thread for MLX operations.env.example— Documented environment variables for STT, TTS, and transport configurationModified files
src/pipecat_mcp_server/agent.py— STT service selection viaSTT_SERVICEenv var,_parse_voxtral_delay_ms()validation,load_dotenv(override=False)src/pipecat_mcp_server/agent_ipc.py—load_dotenv(override=False)to avoid overwriting explicit env varspyproject.toml— Addvoxmlxdependency (macOS only).gitignore— Adduv.lock, add!.env.exampleexceptionREADME.md— Add Configuration section with STT/TTS options tablesArchitecture (streaming mode)
Key design decisions
queue.Queuefor thread-safe async↔sync communication_flush_until_eos()polls token queue after VAD stop to ensure final transcript is emittedmx.clear_cache()after everymx.eval()to prevent MLX memory pool leaks_partial_textaccumulator_THREAD_ERRORsentinelTest plan
macOS (Apple Silicon)
STT_SERVICE=voxtral— verify segmented mode transcribes correctlySTT_SERVICE=voxtral-streaming— verify streaming mode emits interim + final transcriptionsVOXTRAL_DELAY_MSvalues — verify validation and roundingSTT_SERVICE) — verify Whisper MLX fallback still worksNon-darwin (Linux)
WhisperSTTServicewithfaster-distil-whisper-large-v3) still worksvoxmlxdependency is not pulled in (conditional onsys_platform == 'darwin')STT_SERVICE=voxtralorvoxtral-streamingfails gracefully on Linux (import error from voxmlx/MLX)