A low-latency, fully on-device voice and vision assistant for Reachy Mini Lite powered by NVIDIA Jetson. Everything runs locally with GPU acceleration — no cloud, no API keys, no internet required at runtime.
Current target: Jetson Orin Nano 8GB (JetPack 6.x, Python 3.10)
AGX Orin and Thor support is planned — see Roadmap.
Speak to Reachy Mini and it responds using a vision-language model that sees through its camera. The robot moves its head and antennas while it talks, and you can watch everything live through a browser-based UI.
[Mic] → [Silero VAD] → [faster-whisper STT] ──┐
[USB Camera] → [Frame Ring Buffer] ────────────┼→ [VLM stream] → [TTS stream] → [Speaker + Robot]
└→ [Web UI via WebSocket]
| Mode | Entry Point | Description |
|---|---|---|
| Vision Chat | python3 run_vision_chat.py |
Camera + VLM + voice (terminal only) |
| Web Vision Chat | python3 run_web_vision_chat.py |
Same as above + browser UI at :8090 |
| Voice Chat | python3 run_voice_chat.py |
Text LLM + optional RAG (no camera) |
| Text Chat | python3 main.py chat -t |
Interactive text chat (no mic/speaker) |
| CLI | python3 main.py ask "..." |
Single question, one-shot answer |
| Component | Library | Acceleration | Notes |
|---|---|---|---|
| VLM | llama.cpp (Docker) | GPU | Cosmos-Reason2-2B GGUF, OpenAI-compatible API |
| LLM | llama.cpp (Docker) | GPU | Gemma 3 1B for text-only mode |
| STT | faster-whisper | GPU (CUDA) | CTranslate2 with CUDA, small.en default |
| TTS | Kokoro ONNX | GPU (CUDA) | Natural voices, subprocess-isolated (see License Notes) |
| VAD | Silero VAD | CPU | Neural VAD, far better than energy-only |
| Camera | OpenCV V4L2 | CPU | 3 fps ring buffer, configurable resolution |
| Robot | Reachy Mini SDK | USB | Head pose, antennas, wake/sleep |
| RAG | ChromaDB + llama.cpp | GPU | bge-small-en-v1.5 embeddings (voice chat only) |
| Web UI | FastAPI + WebSocket | CPU | Live video, conversation stream, system stats |
- NVIDIA Jetson Orin Nano (8GB) with JetPack 6.x, Python 3.10, Docker + NVIDIA runtime
- Reachy Mini Lite connected via USB
- NVMe SSD recommended for swap and model storage
See SETUP.md for the full installation guide — hardware setup, dependencies, Python packages, model downloads, and troubleshooting.
This is the recommended mode — VLM + camera + voice + browser dashboard:
Terminal 1 — Start the VLM server:
NP=1 ./run_llama_cpp.sh Kbenkhaled/Cosmos-Reason2-2B-GGUF:Q4_K_MWait until you see llama server listening at http://0.0.0.0:8080.
Terminal 2 — Start the assistant:
source venv/bin/activate
python3 run_web_vision_chat.pyOpen http://<jetson-ip>:8090 in a browser to see the live UI with camera feed, conversation log, and system stats. The robot listens through its microphone and responds via VLM + TTS.
Press Ctrl+C once to exit cleanly (robot will go to sleep position).
Same pipeline without the web UI:
NP=1 ./run_llama_cpp.sh Kbenkhaled/Cosmos-Reason2-2B-GGUF:Q4_K_M
# In another terminal:
source venv/bin/activate
python3 run_vision_chat.pyFor text-only conversations with optional RAG:
./run_llama_cpp.sh ggml-org/gemma-3-1b-it-GGUF:Q8_0
# For RAG, also start the embedding server:
./run_llama_embedding.sh ggml-org/bge-small-en-v1.5-Q8_0-GGUF:Q8_0
# In another terminal:
source venv/bin/activate
python3 run_voice_chat.py # with RAG
python3 run_voice_chat.py --no-rag # without RAGpython3 main.py chat -t # interactive text chat
python3 main.py ask "What is the Jetson Orin?" # single question
python3 main.py info # system info
python3 main.py rag-status # RAG index status
python3 main.py rag-search "GPU specs" # search the knowledge basepython3 scripts/test_reachy_movement.py# Stop the LLM/VLM Docker container:
docker stop assistant-llm
# Stop the embedding server (if running):
docker stop assistant-embedThe web UI (run_web_vision_chat.py) provides a real-time dashboard accessible from any browser on the same network:
- Live camera feed at 10 fps (independent of the 3 fps VLM ring buffer)
- Conversation log with streaming VLM responses
- Push-to-talk button (starts muted, click to unmute)
- System stats — CPU, GPU, RAM usage
- Config panel — displays active settings
- Platform detection — shows the specific Jetson model
Access at http://<jetson-ip>:8090. The web UI adds minimal overhead (~5 MB RAM).
All settings live in config/settings.yaml. Edit this file to tune behavior:
| Section | What It Controls |
|---|---|
llm |
LLM server URL, model, temperature, max tokens, system prompts |
stt |
Whisper model size, CUDA device, beam size |
tts |
Voice, speed, language, chunking |
audio |
Sample rate, input device |
vad |
Silero threshold, silence duration, utterance filters |
vision |
Camera resolution, capture FPS, frames per query, VLM system prompt, few-shot examples |
reachy |
Robot connection, daemon behavior, wake/sleep on start/exit |
web |
UI FPS, host, port |
rag |
Embedding backend, knowledge directory, retrieval settings |
For developers adding new config fields, see app/config.py — typed dataclasses that define the schema and fallback defaults. The YAML always wins at runtime; the dataclass default is used if a key is missing from YAML.
reachy-mini-jetson-assistant/
├── app/
│ ├── pipeline.py # Audio I/O, VAD, TTS streaming, mic recording
│ ├── config.py # Configuration dataclasses + YAML loader
│ ├── llm.py # LLM/VLM client (OpenAI-compatible, multimodal)
│ ├── stt.py # faster-whisper speech-to-text
│ ├── tts.py # TTS client (spawns subprocess worker)
│ ├── tts_worker.py # TTS subprocess (Kokoro + GPL deps, isolated)
│ ├── camera.py # USB webcam ring buffer (OpenCV, V4L2)
│ ├── reachy.py # Reachy Mini connection, daemon management
│ ├── web.py # FastAPI + WebSocket server for browser UI
│ ├── monitor.py # System resource monitoring (CPU/GPU/RAM)
│ ├── rag.py # ChromaDB + embeddings retrieval
│ ├── audio.py # PulseAudio / ALSA device helpers
│ └── cli.py # Typer CLI (chat, ask, rag-*)
├── config/
│ └── settings.yaml # All runtime configuration
├── static/
│ └── index.html # Web UI (single-file HTML/CSS/JS)
├── scripts/
│ ├── bench_ttft.py # VLM TTFT benchmark
│ ├── test_reachy_movement.py # Robot movement test
│ └── test_vlm_prompts.py # VLM prompt experiments
├── knowledge_base/ # Markdown docs for RAG
├── models/ # Local GGUF models (gitignored)
├── voices/ # TTS voice files (gitignored)
├── run_web_vision_chat.py # Vision chat + web UI (recommended)
├── run_vision_chat.py # Vision chat (terminal only)
├── run_voice_chat.py # Voice chat with optional RAG
├── run_llama_cpp.sh # Docker LLM/VLM server launcher
├── run_llama_embedding.sh # Docker embedding server launcher
├── main.py # CLI entry point
└── requirements.txt # Python dependencies
| Metric | Value |
|---|---|
| STT latency | ~0.7s (small.en, beam=1) |
| VLM TTFT (warm cache) | ~6–8s (Cosmos-Reason2-2B Q4_K_M) |
| VLM TTFT (cold) | ~8–10s |
| TTS latency (first chunk) | ~0.3s (Kokoro GPU) |
| End-to-end (speak → robot responds) | ~8–12s |
| Peak RAM | ~7.5 GB (STT + VLM + TTS + camera + web UI) |
The VLM vision encoder prefill is the primary bottleneck on Orin Nano. Flash attention (-fa on) and KV cache prefix reuse (--cache-reuse 256) are enabled in run_llama_cpp.sh to minimize repeated work across queries.
- Orin Nano 8GB — full pipeline validated
- Web UI with live camera, conversation log, push-to-talk
- Kokoro TTS GPU acceleration
- Silero VAD for robust speech detection
- KV cache reuse + flash attention for faster VLM TTFT
- AGX Orin — larger models (Cosmos-Reason2-7B, Gemma 3 4B), higher resolution, multi-turn context
- Thor — real-time VLM, multi-camera, extended context windows
- Multi-turn conversation memory
- Wake word detection (hands-free activation)
- Gesture recognition via camera
- Multi-language support
Contributions for AGX Orin and Thor testing are welcome.
See SETUP.md for common issues and fixes.
| Resource | Link |
|---|---|
| Getting Started | huggingface.co/docs/reachy_mini |
| Reachy Mini Lite Setup | Lite Guide |
| Python SDK Docs | SDK Reference |
| Quickstart | First Behavior |
| AI Integrations | LLMs, Apps, HF Spaces |
| Core Concepts | Architecture & Coordinates |
| Code Examples | github.com/pollen-robotics/reachy_mini/examples |
| Community Apps | Hugging Face Spaces |
| Discord | Join the Community |
| Troubleshooting | FAQ Guide |
This project uses Kokoro ONNX for text-to-speech. Kokoro ONNX itself is MIT-licensed, but it depends on:
- phonemizer-fork — GPL-3.0 (text-to-phoneme conversion)
- espeak-ng — GPL-3.0 (speech synthesis library loaded by
espeakng-loader)
To avoid loading GPL-licensed code into the same process as NVIDIA CUDA libraries, TTS runs in a separate subprocess (app/tts_worker.py). The main application process never imports kokoro-onnx, phonemizer-fork, or espeak-ng — it communicates with the TTS worker via JSON over stdin/stdout. This is the same process-boundary isolation pattern used by the llama.cpp VLM backend (which runs in a separate Docker container).
All other dependencies use permissive licenses (MIT, BSD-3, Apache-2.0). See THIRD-PARTY-NOTICES.md for the full list.
We welcome community contributions. Please see CONTRIBUTING.md for guidelines, including the Developer Certificate of Origin (DCO) sign-off requirement.
Apache 2.0 — see LICENSE for details.

