Offline Speech-to-Text service using Mistral Voxtral models with Wyoming protocol for Home Assistant integration.
- Python 3.13, managed with
uv - Key deps:
transformers,torch,wyoming,mistral-common[audio] - Build: setuptools, entry point
voxtral-wyoming->voxtral_wyoming.server:cli
src/voxtral_wyoming/
__init__.py # Package version
server.py # CLI entry point, async Wyoming TCP server
audio.py # Audio utilities (clamping, WAV saving, PCM helpers)
transcriber/
__init__.py
base.py # ITranscriber protocol + TranscriptionResult dataclass
voxtral.py # VoxtralTranscriber impl (gen1 + gen2 auto-detection)
Dockerfile,docker-compose.yml,docker-compose.gpu.ymlfor containerized deployment.env.exampledocuments all environment variables with defaultsexamples/client_sample.pyfor testing
- server.py: Async TCP server implementing Wyoming ASR protocol (Describe/Transcribe/AudioStart/AudioChunk/AudioStop events). Config loaded from env vars via
python-dotenv. - transcriber/base.py:
ITranscriberProtocol — accepts PCM16 mono bytes, returnsTranscriptionResult. - transcriber/voxtral.py: Loads Voxtral model via HuggingFace
transformers(not via the alternative vLLM). Auto-detects model generation at load time (gen2VoxtralRealtimeForConditionalGenerationvs gen1VoxtralForConditionalGeneration). Supports two transcription modes: transcribe-only (default) and chat mode with system prompt. - Device auto-detection: CUDA > MPS > CPU, with CPU fallback on failure.
- dtype auto-detected from model files unless overridden via
DATA_TYPEenv var.
uv venv && source .venv/bin/activate && uv sync
voxtral-wyoming # uses .env
voxtral-wyoming path/to.env # custom env file- All config via environment variables (no CLI flags beyond optional env file path)
- Audio format: PCM16 mono, little-endian, default 16kHz
- Model is eagerly loaded at startup to avoid slow first request
- No tests exist yet in the repository, but feel free to add tests