Reachy Claw turns Reachy Mini into a fully local desktop AI robot that can listen, think, speak, emote, and react in real time.
Powered by local voice infrastructure and compatible with both Ollama and OpenClaw, it brings AI interaction out of the screen and into a physical robot experience.
git clone https://github.com/suharvest/reachy-claw.git && cd reachy-claw
uv sync --extra dev --extra audio
ollama pull qwen3.5:2b
uv run reachy-clawThat's it — the robot talks using a local LLM via Ollama. Open http://localhost:8640 for the dashboard.
For tool use and multi-agent capabilities, you can also use OpenClaw as the AI backend. See Configuration for details.
The built-in web dashboard at :8640 is the control center — switch modes, tune parameters, and monitor everything in real time.
Conversation mode — The robot listens, responds, and reacts. Face tracking follows whoever is speaking. Barge-in lets you interrupt mid-sentence. With VLM enabled, the robot can see — ask "what's in front of you?" and it uses its camera to describe the scene.
Monologue mode — The robot mumbles to itself about what it sees — no user interaction needed. Great for exhibitions and demos. Vision-driven emotions only, no motor jitter.
Switch between modes with one click in the dashboard.
| Category | Settings |
|---|---|
| Mode | Conversation / Monologue toggle |
| Voice | Speaker voice, pitch, speed, volume |
| Prompts | System prompts for both modes — fully editable |
| Motor | Enable/disable, preset (sensitive / moderate / smart) |
| Audio | VAD threshold, energy threshold — tune for your environment |
| Vision | VLM (camera vision) on/off, barge-in on/off |
| Memory | Conversation history depth (0 = stateless) |
| Faces | Register/delete faces, import/export face DB, live camera enroll |
All settings persist across restarts via runtime-overrides.yaml.
- Fully local pipeline — Paraformer ASR + Matcha TTS via sherpa-onnx, runs on Jetson / RK3588 / any CUDA device, no cloud required
- Two AI modes — Ollama for zero-setup local LLM, or OpenClaw for tool use and multi-agent; switch with one click
- Emotion-driven motion — 14 distinct emotions mapped to head movements and antenna expressions
- Face tracking — MediaPipe-powered gaze following so the robot looks at whoever is speaking
- Streaming TTS — sentence-level streaming for low-latency responses
- Vision (VLM tool calling) — Ask "what do you see?" and the robot uses Ollama's native tool calling to grab a camera frame, run a vision model (e.g. Qwen 3.5 2B), and describe the scene — all on-device
- Robust barge-in — 3-layer noise filtering (energy gate + VAD threshold + cooldown) for reliable interrupts even in noisy environments; toggle on/off live from the dashboard
- Web dashboard — live video, ASR transcript, face management, and all settings in one place
- Pluggable backends — swap STT/TTS/VAD/LLM without changing code (Paraformer, Matcha, Whisper, ElevenLabs, Ollama, OpenClaw, and more)
- Plugin architecture — Motion, Conversation, FaceTracker, Dashboard, and VisionClient as independent plugins
- Flexible deployment — Docker Compose on Jetson, standalone, simulator, or via Reachy Mini daemon
- Quickstart
- Dashboard
- Key Features
- Latency
- AI Stack
- Architecture
- Edge Speech Service
- Running as a Reachy Mini App
- Configuration
- Speech Backends
- Installation
- Scripts
- OpenClaw Skill
- Development
- Key Files
- Acknowledgements
Measured on Jetson Orin NX with CUDA acceleration:
| Stage | Engine | Latency |
|---|---|---|
| Speech detection (VAD) | Silero VAD | ~10ms |
| Speech-to-text | Paraformer streaming (sherpa-onnx) | ~50ms TTFT |
| Text-to-speech | Matcha-TTS + Vocos (sherpa-onnx) | ~60ms TTFT |
| ASR + TTS combined | ~110ms |
Full voice-to-voice latency (including LLM inference) is typically under 1 second with a 2B parameter model.
Everything runs locally on edge hardware — no cloud APIs required. Here's the full model inventory:
| Domain | Component | Model | Accelerator | Notes |
|---|---|---|---|---|
| Voice AI | Voice Activity Detection | Silero VAD | CPU (ONNX) | ~10ms latency |
| Speech-to-Text (streaming) | Paraformer / Zipformer | CUDA | Paraformer: zh+en ~50ms TTFT; Zipformer: en ~50ms TTFT | |
| Text-to-Speech | Matcha-TTS + Vocos / Kokoro | CUDA | Matcha: zh+en ~60ms TTFT; Kokoro: en ~130ms TTFT | |
| LLM | Conversational AI | Qwen 3.5 2B | CUDA (Ollama) | On-device; swappable to any Ollama/OpenClaw model |
| Vision AI | Face Detection | SCRFD-2.5D | TensorRT | Real-time multi-face detection |
| Face Embedding | ArcFace MBF (W600K) | TensorRT | Face recognition & re-identification | |
| Emotion Detection | EfficientNet-B0 (AFEW) | TensorRT | 8-class facial emotion classification | |
| Face Tracking (client) | MediaPipe Face Detection | CPU | Gaze following on the robot side (as fallback) |
The STT/TTS pipeline is powered by Jetson Voice — a standalone Docker service built on sherpa-onnx that handles ASR and TTS inference with CUDA acceleration. VAD runs locally on the client side (reachy-claw) via ONNX Runtime — no network round-trip needed for speech detection. See Edge Speech Service for setup.
The vision pipeline runs in the vision-trt container using TensorRT-optimized models for face detection, embedding, and emotion recognition.
Reachy Claw is the client-side runtime. The AI backend is swappable — Ollama or OpenClaw:
┌─────────────────────────────────────────────────────────────────────┐
│ reachy-claw (this project) │
│ │
│ Plugins: │
│ ConversationPlugin — mic → VAD → STT → [LLM] → TTS → speaker │
│ MotionPlugin — emotions, head tracking, idle animations │
│ FaceTrackerPlugin — MediaPipe face detection → head gaze │
│ │
│ Shared state: │
│ HeadTargetBus — fuses face/DOA/neutral head targets │
│ EmotionMapper — 14 emotions → head+antenna poses │
│ HeadWobbler — speech-driven head micro-movements │
└──────┬──────────────────┬───────────────┬────────────┬──────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Reachy Mini SDK Speech Service Ollama OpenClaw Gateway
(reachy-mini) (jetson-voice) (local LLM) (openclaw)
─── OR ───
The [LLM] block in the pipeline is a pluggable interface: OllamaClient calls Ollama's HTTP API directly; DesktopRobotClient connects to OpenClaw via WebSocket. Both implement the same streaming response protocol, so the rest of the pipeline (STT, TTS, motion, barge-in) works identically.
| Project | Role | Default Port | Protocol |
|---|---|---|---|
| reachy-claw (this) | Client runtime — plugins, STT/TTS pipeline, robot control | — | — |
| Ollama | Local LLM inference (optional, for Ollama mode) | :11434 |
HTTP |
| OpenClaw | AI gateway — LLM, tools, multi-turn conversation (optional, for OpenClaw mode) | :18789 (gateway) |
HTTP |
| ↳ desktop-robot extension | WebSocket bridge for voice assistants | :18790 |
ws://.../desktop-robot |
| Jetson Voice | Edge speech — Paraformer ASR + Matcha TTS (sherpa-onnx, CUDA) | :8621 |
HTTP REST |
| reachy-mini daemon | Robot hardware — SDK access, motor control | :8000 (dev) / :38001 (Jetson) |
Zenoh :7447 |
| Plugin | Purpose | Depends on |
|---|---|---|
ConversationPlugin |
Full conversation loop: mic → VAD → STT → LLM (Ollama or OpenClaw) → TTS → speaker. Handles barge-in, emotions, and robot commands. | Ollama or OpenClaw, Speech service |
MotionPlugin |
Executes emotion expressions (head + antenna), fused head tracking, idle animations. | Reachy Mini SDK |
FaceTrackerPlugin |
MediaPipe face detection → HeadTarget published to HeadTargetBus. | Camera |
DashboardPlugin |
Web dashboard with live video, ASR transcript, robot state, and settings UI. | — |
VisionClientPlugin |
Connects to vision-trt via ZMQ for face/emotion data and smile capture events. | vision-trt |
reachy-claw connects to OpenClaw's desktop-robot extension via WebSocket.
Passive emotion channel — The LLM includes [emotion:happy] tags in its responses. The extension strips them from the text stream and sends a separate emotion message. reachy-claw receives it and immediately queues the expression — zero extra round-trip.
LLM output: "[emotion:happy] Sure, I'd love to help!"
→ extension sends: {"type": "emotion", "emotion": "happy"}
→ extension sends: {"type": "stream_delta", "text": "Sure, I'd love to help!"}
→ reachy-claw: queue_emotion("happy") → MotionPlugin executes head+antenna pose
Active robot commands — User says "do a dance", LLM calls a reachy_* tool, extension forwards it as robot_command for client-side execution:
LLM tool call: reachy_dance({dance_name: "celebrate"})
→ extension sends: {"type": "robot_command", "action": "dance", "params": {...}, "commandId": "..."}
→ reachy-claw: execute dance routine → send robot_result back
Available actions: move_head, move_antennas, play_emotion, dance, capture_image, status. See action-skill/SKILL.md for the full tool interface.
Microphone → VAD (Silero) → Speech detected?
→ STT (Paraformer streaming) → text
→ LLM (Ollama HTTP or OpenClaw WebSocket) → AI response (streaming)
→ Sentence splitter → TTS (Matcha, streaming) → Speaker
→ EmotionMapper → Robot head/antennas
Camera → MediaPipe → HeadTarget → HeadTargetBus → Robot head
TTS audio → HeadWobbler → speech roll/pitch/yaw → Robot head
Barge-in: energy gate → VAD probability → confirm frames → interrupt TTS → new turn
On Jetson, all services run as containers with network_mode: host. A single docker-compose.yml manages everything — OpenClaw is an opt-in profile:
# Ollama mode (default — 2 containers)
docker compose up -d
# OpenClaw mode (3 containers)
docker compose --profile openclaw up -d
# + Vision (face detection, emotion recognition, smile capture)
docker compose --profile vision up -d
# OpenClaw + Vision
docker compose --profile openclaw --profile vision up -d| Container | Profile | Ports | Purpose |
|---|---|---|---|
reachy-daemon |
always | :38001 (FastAPI), :7447 (Zenoh) |
Robot hardware control |
reachy-claw |
always | :8640 (dashboard), :8641 (API) |
Conversation + dashboard |
speech (external) |
always | :8621 |
STT/TTS (sherpa-onnx) |
openclaw-gateway |
openclaw |
:18789, :18790 |
AI gateway (LLM + tools) |
vision-trt |
vision |
:8630 (API), :8631 (ZMQ), :8632 (stream) |
TensorRT face/emotion pipeline |
One-click deploy from your laptop:
cd deploy/jetson
./deploy.sh # Ollama mode
./deploy.sh --openclaw # OpenClaw mode
./deploy.sh --setup-openclaw # OpenClaw + first-time extension setupAuto-launch the dashboard in fullscreen on Jetson boot:
# Install from your laptop (remote)
bash deploy/jetson/kiosk/install.sh user@jetson-ip
# Or install locally on the Jetson
bash deploy/jetson/kiosk/install.shThis installs a GNOME autostart entry that waits for the dashboard to be ready, then opens Chromium in kiosk mode.
See deploy/jetson/ for Dockerfiles and compose config.
The current pinned image tag is reachy-claw:v1.21 (see
CHANGELOG.md for what changed).
The speech service is maintained as a standalone project: Jetson Voice — a Docker-based voice stack that runs Paraformer ASR and Matcha TTS on Jetson (or any CUDA device) with sherpa-onnx.
# On your edge device (e.g. Jetson Orin NX, JetPack 6.2)
git clone https://github.com/suharvest/jetson-local-voice.git
cd jetson-local-voice
docker compose build && docker compose up -d
curl http://localhost:8000/healthThen point reachy-claw at it:
uv run reachy-claw \
--stt paraformer-streaming \
--tts matcha \
--speech-url http://<device-ip>:8000See the Jetson Voice README for full API docs, benchmarks, model comparison, and patched sherpa-onnx details.
This project can run in two ways:
uv run reachy-claw --gateway-host 192.168.1.100The app manages the Reachy Mini connection itself.
The project registers as a Reachy Mini app via the reachy_mini_apps entry point. Install it into the daemon's environment:
pip install /path/to/reachy-clawThen start via the daemon API:
# List available apps
curl http://localhost:8000/apps/list-available
# Start
curl http://localhost:8000/apps/start-app/reachy_claw
# Stop
curl http://localhost:8000/apps/stop-current-appOr run directly as a Reachy Mini app (daemon must be running):
python -m reachy_claw.reachy_appIn daemon mode, the Reachy Mini connection is managed by the daemon and passed to the app.
Most settings can be changed live through the Dashboard — no restart needed. For initial setup or headless deployment, use the YAML config file.
Configuration is layered (highest priority wins):
CLI args > Environment variables > Dashboard overrides > YAML config file > Defaults
Dashboard changes are saved to runtime-overrides.yaml automatically and persist across restarts.
Copy the example and edit:
cp reachy-claw.example.yaml reachy-claw.yamlThe app auto-detects config files in this order:
./reachy-claw.yamlor./reachy-claw.yml(current directory)~/.reachy-claw/config.yaml
Or specify explicitly:
reachy-claw --config /path/to/config.yaml
# or
export REACHY_CLAW_CONFIG=/path/to/config.yamlExample: Ollama mode (simplest — no gateway needed):
llm:
backend: ollama
model: qwen3.5:2b
max_history: 5
vad:
backend: sileroExample: OpenClaw mode (full features, edge deployment with Jetson):
llm:
backend: gateway # default — use OpenClaw
gateway:
host: 192.168.1.100
port: 18790
stt:
backend: paraformer-streaming
speech_service_url: http://192.168.1.50:8621
tts:
backend: matcha
speech_service_url: http://192.168.1.50:8621
matcha:
speaker_id: 0
speed: 1.2
vad:
backend: silero
behavior:
wake_word: hey reachy
play_emotions: true
vision:
tracker: mediapipe
camera_index: 0See reachy-claw.example.yaml for the full list of 70+ options.
| Variable | Description |
|---|---|
OPENCLAW_HOST |
Gateway host (default: 127.0.0.1) |
OPENCLAW_PORT |
Gateway port (default: 18790) |
OPENCLAW_TOKEN |
Gateway auth token |
OPENCLAW_PATH |
WebSocket path (default: /desktop-robot) |
STT_BACKEND |
STT backend (default: paraformer-streaming) |
TTS_BACKEND |
TTS backend (default: matcha) |
SPEECH_SERVICE_URL |
Remote speech service URL |
WHISPER_MODEL |
Whisper model size: tiny, base, small, medium, large |
WAKE_WORD |
Wake word to activate listening |
OPENCLAW_OPENAI_TOKEN / OPENAI_API_KEY |
OpenAI API key (for --stt openai) |
REACHY_CLAW_CONFIG |
Path to YAML config file |
Backend-specific env vars are auto-generated from each backend's Settings class (e.g. MATCHA_SPEAKER_ID, MATCHA_SPEED, SENSEVOICE_LANGUAGE).
ElevenLabs TTS:
REACHY_ELEVENLABS_API_KEYorELEVENLABS_API_KEY(required)REACHY_ELEVENLABS_VOICE_IDorELEVENLABS_VOICE_ID(optional)REACHY_ELEVENLABS_MODEL_IDorELEVENLABS_MODEL_ID(optional)REACHY_ELEVENLABS_OUTPUT_FORMATorELEVENLABS_OUTPUT_FORMAT(optional)
-c, --config Path to YAML config file
-v, --verbose Debug logging
--gateway-host OpenClaw host (default: 127.0.0.1)
--gateway-port OpenClaw port (default: 18790)
--gateway-token Auth token
--reachy-mode auto | localhost_only | network
--stt (choices auto-detected from registry)
--whisper-model tiny | base | small | medium | large
--tts (choices auto-detected from registry)
--tts-voice Voice ID (backend-specific)
--tts-model Model path (for Piper)
--speech-url Remote speech service URL
--audio-device Input device name
--vad silero | energy
--wake-word Wake phrase
--no-emotions Disable emotion animations
--no-idle Disable idle animations
--no-barge-in Disable barge-in
--no-face-tracking Disable face tracking
--tracker-type mediapipe | none
--camera-index Camera device index
--standalone Run without gateway
--demo Run robot movement demo and exit
Backends are discovered automatically via the @register_tts / @register_stt / @register_vad decorators in backend_registry.py.
| Name | Type | Description |
|---|---|---|
paraformer-streaming |
Remote (sherpa-onnx) | Default. Streaming ASR, bilingual zh+en, ~50ms TTFT |
sensevoice |
Remote (sherpa-onnx) | Offline ASR, 5 languages (zh/en/ja/ko/yue) |
whisper |
Local | OpenAI Whisper |
faster-whisper |
Local | CTranslate2-optimized Whisper |
openai |
Cloud | OpenAI Whisper API |
| Name | Type | Description |
|---|---|---|
matcha |
Remote (sherpa-onnx) | Default. Matcha-TTS + Vocos, best Chinese quality, ~60ms TTFT |
kokoro |
Remote (sherpa-onnx) | Kokoro TTS, multilingual |
elevenlabs |
Cloud | ElevenLabs API |
macos-say |
Local | macOS built-in say command |
piper |
Local | Piper neural TTS |
none |
-- | Dummy (prints text, no audio) |
| Name | Type | Description |
|---|---|---|
silero |
Local (ONNX) | Default. Silero VAD, accurate speech detection |
energy |
Local | Simple RMS energy threshold |
Create a class in tts.py or stt.py with the decorator -- that's it:
@register_tts("my-backend")
class MyTTS(TTSBackend):
"""My custom TTS backend."""
# Optional: declare backend-specific config fields
class Settings:
api_key: str = ""
voice_type: str = "default"
def __init__(self, base_url="http://localhost:8000", api_key="", voice_type="default"):
self._base_url = base_url
self._api_key = api_key
self._voice_type = voice_type
async def synthesize(self, text: str) -> str:
# Call your API, return path to temp audio file
...This automatically provides:
--tts my-backendCLI optiontts.my_backend_api_key/tts.my_backend_voice_typein YAML configMY_BACKEND_API_KEY/MY_BACKEND_VOICE_TYPEenvironment variablesconfig.my_backend_api_key/config.my_backend_voice_typeconfig attributes
- Python 3.10+
- Reachy Mini SDK (
reachy-mini) - For edge speech: a CUDA device running the Jetson Voice speech service (see Edge Speech Service)
uv syncDevelopment install:
uv sync --extra dev- local mic input:
uv sync --extra audio - local faster transcription:
uv sync --extra local-stt - OpenAI cloud transcription:
uv sync --extra cloud-stt - Reachy vision extras:
uv sync --extra vision - MediaPipe face tracking:
uv sync --extra mediapipe-vision - MuJoCo simulator:
uv sync --extra sim
| Script | Purpose |
|---|---|
scripts/run_sim.sh |
Launch MuJoCo simulator + gateway check + start app (no face tracking) |
scripts/run_real.sh |
Pre-flight checks (gateway, Jetson, Reachy gRPC) + start with edge speech |
scripts/run_sim_daemon.py |
MuJoCo simulator daemon (macOS, GStreamer shim) |
The action-skill/SKILL.md describes the robot control tools available to OpenClaw's LLM. Give this file to your OpenClaw agent so it knows what it can do:
reachy_move_head/reachy_move_antennas— direct motor controlreachy_play_emotion— 14 emotions expressed as head+antenna posesreachy_dance— 5 choreographed routines (nod, wiggle, celebrate, curious_look, lobster)reachy_capture_image— camera snapshotreachy_status— connection state and live positions
No Python code in this directory — it's a pure interface description. Execution happens in reachy-claw's ConversationPlugin via the robot_command WebSocket channel.
uv sync --extra dev
uv run pytest
uv tool run ruff check .| File | Purpose |
|---|---|
src/reachy_claw/main.py |
CLI entrypoint |
src/reachy_claw/app.py |
ReachyClawApp orchestrator |
src/reachy_claw/reachy_app.py |
Reachy Mini daemon app adapter |
src/reachy_claw/gateway.py |
OpenClaw WebSocket protocol (emotion + robot_command channels) |
src/reachy_claw/llm.py |
Ollama LLM client (HTTP streaming, conversation history) |
src/reachy_claw/plugins/conversation_plugin.py |
Conversation loop + robot command handlers |
src/reachy_claw/plugins/motion_plugin.py |
Emotion execution + head tracking fusion |
src/reachy_claw/plugins/face_tracker_plugin.py |
MediaPipe face detection → HeadTarget |
src/reachy_claw/motion/emotion_mapper.py |
14 emotions → head+antenna poses |
src/reachy_claw/motion/dances.py |
5 dance routines (choreographed movement sequences) |
src/reachy_claw/motion/head_target.py |
HeadTargetBus — fused head target from multiple sources |
src/reachy_claw/motion/head_wobbler.py |
Speech-driven head micro-movements |
src/reachy_claw/backend_registry.py |
Auto-discovery registry for STT/TTS/VAD backends |
src/reachy_claw/config.py |
Configuration (YAML + env + CLI, 70+ options) |
src/reachy_claw/plugins/dashboard_plugin.py |
Web dashboard — video proxy, WebSocket state, settings |
src/reachy_claw/plugins/vision_client_plugin.py |
Vision-TRT ZMQ client — face/emotion/capture relay |
action-skill/SKILL.md |
OpenClaw tool interface for LLM-driven robot control |
deploy/jetson/ |
Dockerfiles + compose for Jetson deployment |
deploy/jetson/kiosk/ |
Kiosk autostart scripts for exhibition displays |
This project was originally based on clawd-reachy-mini by Artur Skowronski.
- OpenClaw -- AI gateway framework
- sherpa-onnx -- speech inference engine (Paraformer ASR, Matcha TTS, Silero VAD)
- Pollen Robotics -- Reachy Mini hardware and SDK
- MediaPipe -- face detection and tracking
- OpenAI Whisper -- speech-to-text
- ElevenLabs / Piper -- text-to-speech engines
