Skip to content

NVIDIA-AI-IOT/reachy-mini-jetson-assistant

Repository files navigation

Reachy Mini Jetson Assistant

Reachy Mini Lite    x    NVIDIA Jetson

A low-latency, fully on-device voice and vision assistant for Reachy Mini Lite powered by NVIDIA Jetson. Everything runs locally with GPU acceleration — no cloud, no API keys, no internet required at runtime.

Current target: Jetson Orin Nano 8GB (JetPack 6.x, Python 3.10)

AGX Orin and Thor support is planned — see Roadmap.

What It Does

Speak to Reachy Mini and it responds using a vision-language model that sees through its camera. The robot moves its head and antennas while it talks, and you can watch everything live through a browser-based UI.

[Mic] → [Silero VAD] → [faster-whisper STT] ──┐
[USB Camera] → [Frame Ring Buffer] ────────────┼→ [VLM stream] → [TTS stream] → [Speaker + Robot]
                                               └→ [Web UI via WebSocket]

Demo

Reachy Mini Jetson Assistant — Web UI

Supported Modes

Mode Entry Point Description
Vision Chat python3 run_vision_chat.py Camera + VLM + voice (terminal only)
Web Vision Chat python3 run_web_vision_chat.py Same as above + browser UI at :8090
Voice Chat python3 run_voice_chat.py Text LLM + optional RAG (no camera)
Text Chat python3 main.py chat -t Interactive text chat (no mic/speaker)
CLI python3 main.py ask "..." Single question, one-shot answer

Stack

Component Library Acceleration Notes
VLM llama.cpp (Docker) GPU Cosmos-Reason2-2B GGUF, OpenAI-compatible API
LLM llama.cpp (Docker) GPU Gemma 3 1B for text-only mode
STT faster-whisper GPU (CUDA) CTranslate2 with CUDA, small.en default
TTS Kokoro ONNX GPU (CUDA) Natural voices, subprocess-isolated (see License Notes)
VAD Silero VAD CPU Neural VAD, far better than energy-only
Camera OpenCV V4L2 CPU 3 fps ring buffer, configurable resolution
Robot Reachy Mini SDK USB Head pose, antennas, wake/sleep
RAG ChromaDB + llama.cpp GPU bge-small-en-v1.5 embeddings (voice chat only)
Web UI FastAPI + WebSocket CPU Live video, conversation stream, system stats

Prerequisites

  • NVIDIA Jetson Orin Nano (8GB) with JetPack 6.x, Python 3.10, Docker + NVIDIA runtime
  • Reachy Mini Lite connected via USB
  • NVMe SSD recommended for swap and model storage

Setup

See SETUP.md for the full installation guide — hardware setup, dependencies, Python packages, model downloads, and troubleshooting.

Usage

Quick Start (Vision Chat with Web UI)

This is the recommended mode — VLM + camera + voice + browser dashboard:

Terminal 1 — Start the VLM server:

NP=1 ./run_llama_cpp.sh Kbenkhaled/Cosmos-Reason2-2B-GGUF:Q4_K_M

Wait until you see llama server listening at http://0.0.0.0:8080.

Terminal 2 — Start the assistant:

source venv/bin/activate
python3 run_web_vision_chat.py

Open http://<jetson-ip>:8090 in a browser to see the live UI with camera feed, conversation log, and system stats. The robot listens through its microphone and responds via VLM + TTS.

Press Ctrl+C once to exit cleanly (robot will go to sleep position).

Vision Chat (Terminal Only)

Same pipeline without the web UI:

NP=1 ./run_llama_cpp.sh Kbenkhaled/Cosmos-Reason2-2B-GGUF:Q4_K_M
# In another terminal:
source venv/bin/activate
python3 run_vision_chat.py

Voice Chat (Text LLM, No Camera)

For text-only conversations with optional RAG:

./run_llama_cpp.sh ggml-org/gemma-3-1b-it-GGUF:Q8_0
# For RAG, also start the embedding server:
./run_llama_embedding.sh ggml-org/bge-small-en-v1.5-Q8_0-GGUF:Q8_0

# In another terminal:
source venv/bin/activate
python3 run_voice_chat.py           # with RAG
python3 run_voice_chat.py --no-rag  # without RAG

CLI Commands

python3 main.py chat -t                        # interactive text chat
python3 main.py ask "What is the Jetson Orin?"  # single question
python3 main.py info                            # system info
python3 main.py rag-status                      # RAG index status
python3 main.py rag-search "GPU specs"          # search the knowledge base

Test Robot Movement

python3 scripts/test_reachy_movement.py

Stopping

# Stop the LLM/VLM Docker container:
docker stop assistant-llm

# Stop the embedding server (if running):
docker stop assistant-embed

Web UI

The web UI (run_web_vision_chat.py) provides a real-time dashboard accessible from any browser on the same network:

  • Live camera feed at 10 fps (independent of the 3 fps VLM ring buffer)
  • Conversation log with streaming VLM responses
  • Push-to-talk button (starts muted, click to unmute)
  • System stats — CPU, GPU, RAM usage
  • Config panel — displays active settings
  • Platform detection — shows the specific Jetson model

Access at http://<jetson-ip>:8090. The web UI adds minimal overhead (~5 MB RAM).

Configuration

All settings live in config/settings.yaml. Edit this file to tune behavior:

Section What It Controls
llm LLM server URL, model, temperature, max tokens, system prompts
stt Whisper model size, CUDA device, beam size
tts Voice, speed, language, chunking
audio Sample rate, input device
vad Silero threshold, silence duration, utterance filters
vision Camera resolution, capture FPS, frames per query, VLM system prompt, few-shot examples
reachy Robot connection, daemon behavior, wake/sleep on start/exit
web UI FPS, host, port
rag Embedding backend, knowledge directory, retrieval settings

For developers adding new config fields, see app/config.py — typed dataclasses that define the schema and fallback defaults. The YAML always wins at runtime; the dataclass default is used if a key is missing from YAML.

Project Structure

reachy-mini-jetson-assistant/
├── app/
│   ├── pipeline.py          # Audio I/O, VAD, TTS streaming, mic recording
│   ├── config.py            # Configuration dataclasses + YAML loader
│   ├── llm.py               # LLM/VLM client (OpenAI-compatible, multimodal)
│   ├── stt.py               # faster-whisper speech-to-text
│   ├── tts.py               # TTS client (spawns subprocess worker)
│   ├── tts_worker.py        # TTS subprocess (Kokoro + GPL deps, isolated)
│   ├── camera.py            # USB webcam ring buffer (OpenCV, V4L2)
│   ├── reachy.py            # Reachy Mini connection, daemon management
│   ├── web.py               # FastAPI + WebSocket server for browser UI
│   ├── monitor.py           # System resource monitoring (CPU/GPU/RAM)
│   ├── rag.py               # ChromaDB + embeddings retrieval
│   ├── audio.py             # PulseAudio / ALSA device helpers
│   └── cli.py               # Typer CLI (chat, ask, rag-*)
├── config/
│   └── settings.yaml        # All runtime configuration
├── static/
│   └── index.html           # Web UI (single-file HTML/CSS/JS)
├── scripts/
│   ├── bench_ttft.py        # VLM TTFT benchmark
│   ├── test_reachy_movement.py   # Robot movement test
│   └── test_vlm_prompts.py  # VLM prompt experiments
├── knowledge_base/          # Markdown docs for RAG
├── models/                  # Local GGUF models (gitignored)
├── voices/                  # TTS voice files (gitignored)
├── run_web_vision_chat.py   # Vision chat + web UI (recommended)
├── run_vision_chat.py       # Vision chat (terminal only)
├── run_voice_chat.py        # Voice chat with optional RAG
├── run_llama_cpp.sh         # Docker LLM/VLM server launcher
├── run_llama_embedding.sh   # Docker embedding server launcher
├── main.py                  # CLI entry point
└── requirements.txt         # Python dependencies

Performance Notes (Orin Nano 8GB)

Metric Value
STT latency ~0.7s (small.en, beam=1)
VLM TTFT (warm cache) ~6–8s (Cosmos-Reason2-2B Q4_K_M)
VLM TTFT (cold) ~8–10s
TTS latency (first chunk) ~0.3s (Kokoro GPU)
End-to-end (speak → robot responds) ~8–12s
Peak RAM ~7.5 GB (STT + VLM + TTS + camera + web UI)

The VLM vision encoder prefill is the primary bottleneck on Orin Nano. Flash attention (-fa on) and KV cache prefix reuse (--cache-reuse 256) are enabled in run_llama_cpp.sh to minimize repeated work across queries.

Roadmap

  • Orin Nano 8GB — full pipeline validated
  • Web UI with live camera, conversation log, push-to-talk
  • Kokoro TTS GPU acceleration
  • Silero VAD for robust speech detection
  • KV cache reuse + flash attention for faster VLM TTFT
  • AGX Orin — larger models (Cosmos-Reason2-7B, Gemma 3 4B), higher resolution, multi-turn context
  • Thor — real-time VLM, multi-camera, extended context windows
  • Multi-turn conversation memory
  • Wake word detection (hands-free activation)
  • Gesture recognition via camera
  • Multi-language support

Contributions for AGX Orin and Thor testing are welcome.

Troubleshooting

See SETUP.md for common issues and fixes.

Reachy Mini Resources

Resource Link
Getting Started huggingface.co/docs/reachy_mini
Reachy Mini Lite Setup Lite Guide
Python SDK Docs SDK Reference
Quickstart First Behavior
AI Integrations LLMs, Apps, HF Spaces
Core Concepts Architecture & Coordinates
Code Examples github.com/pollen-robotics/reachy_mini/examples
Community Apps Hugging Face Spaces
Discord Join the Community
Troubleshooting FAQ Guide

License Notes

This project uses Kokoro ONNX for text-to-speech. Kokoro ONNX itself is MIT-licensed, but it depends on:

  • phonemizer-fork — GPL-3.0 (text-to-phoneme conversion)
  • espeak-ng — GPL-3.0 (speech synthesis library loaded by espeakng-loader)

To avoid loading GPL-licensed code into the same process as NVIDIA CUDA libraries, TTS runs in a separate subprocess (app/tts_worker.py). The main application process never imports kokoro-onnx, phonemizer-fork, or espeak-ng — it communicates with the TTS worker via JSON over stdin/stdout. This is the same process-boundary isolation pattern used by the llama.cpp VLM backend (which runs in a separate Docker container).

All other dependencies use permissive licenses (MIT, BSD-3, Apache-2.0). See THIRD-PARTY-NOTICES.md for the full list.

Contributing

We welcome community contributions. Please see CONTRIBUTING.md for guidelines, including the Developer Certificate of Origin (DCO) sign-off requirement.

License

Apache 2.0 — see LICENSE for details.

About

Reachy Mini Conversational Pipeline on Jetson

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors