Skip to content

Latest commit

 

History

History
54 lines (41 loc) · 2.57 KB

File metadata and controls

54 lines (41 loc) · 2.57 KB

AGENTS.md

Offline Speech-to-Text service using Mistral Voxtral models with Wyoming protocol for Home Assistant integration.

Tech Stack

  • Python 3.13, managed with uv
  • Key deps: transformers, torch, wyoming, mistral-common[audio]
  • Build: setuptools, entry point voxtral-wyoming -> voxtral_wyoming.server:cli

Project Structure

src/voxtral_wyoming/
  __init__.py          # Package version
  server.py            # CLI entry point, async Wyoming TCP server
  audio.py             # Audio utilities (clamping, WAV saving, PCM helpers)
  transcriber/
    __init__.py
    base.py            # ITranscriber protocol + TranscriptionResult dataclass
    voxtral.py         # VoxtralTranscriber impl (gen1 + gen2 auto-detection)
  • Dockerfile, docker-compose.yml, docker-compose.gpu.yml for containerized deployment
  • .env.example documents all environment variables with defaults
  • examples/client_sample.py for testing

Architecture

  • server.py: Async TCP server implementing Wyoming ASR protocol (Describe/Transcribe/AudioStart/AudioChunk/AudioStop events). Config loaded from env vars via python-dotenv.
  • transcriber/base.py: ITranscriber Protocol — accepts PCM16 mono bytes, returns TranscriptionResult.
  • transcriber/voxtral.py: Loads Voxtral model via HuggingFace transformers (not via the alternative vLLM). Auto-detects model generation at load time (gen2 VoxtralRealtimeForConditionalGeneration vs gen1 VoxtralForConditionalGeneration). Supports two transcription modes: transcribe-only (default) and chat mode with system prompt.
  • Device auto-detection: CUDA > MPS > CPU, with CPU fallback on failure.
  • dtype auto-detected from model files unless overridden via DATA_TYPE env var.

Dev Setup

uv venv && source .venv/bin/activate && uv sync
voxtral-wyoming              # uses .env
voxtral-wyoming path/to.env  # custom env file

Key Conventions

  • All config via environment variables (no CLI flags beyond optional env file path)
  • Audio format: PCM16 mono, little-endian, default 16kHz
  • Model is eagerly loaded at startup to avoid slow first request
  • No tests exist yet in the repository, but feel free to add tests

External Documentation