Voice Assistant (Go + sherpa-onnx)

A real-time voice assistant that runs entirely locally implemented in Go using sherpa-onnx for speech recognition and synthesis.

This is my first foray into the world of AI-powered voice assistants. As a fan of hyper-efficient code and with edge devices in mind, I’m avoiding Python and instead building the assistant in Go for learning purposes and to gain experience developing applications with CoreML and CUDA support.

For the LLM processing, I'm relying on Ollama, as it works perfectly for this use case, even though I'm just scratching the surface of what I can do with it.

Features

Voice Activity Detection (VAD): Silero-VAD for accurate speech boundary detection
Speech-to-Text (STT): Pluggable backend (--stt-backend); ships with Whisper multilingual model for high-quality transcription (99 languages)
Text-to-Speech (TTS): Pluggable backend (--tts-backend); ships with Kokoro for natural-sounding voice synthesis with emotional expression
LLM Integration: Ollama API for conversational AI with agentic tool calling
Agentic Tools: Weather information and web search capabilities
Low Latency: Optimized for real-time conversation
Interrupt Support: Optional stop playback when user starts speaking
Wake Word: Optional wake word activation
Hardware Acceleration: Auto-detected CoreML (macOS) and CUDA (Linux)
Multilingual: Both STT and TTS support multiple languages (English, Spanish, French, German, etc.)
Live Translation: Zero-code configuration for real-time language translation
Configurable Temperature: Adjustable LLM temperature for translation vs. conversational tasks
Shared Assets: Models stored in ~/.voice-assistant

Cross-Platform Support

This implementation supports multiple platforms with hardware acceleration:

Platform	STT Provider	TTS Provider	Notes
macOS (Intel)	`coreml`	`coreml`	Full CoreML acceleration
macOS (Apple Silicon)	`coreml`	`coreml`	ANE for STT and TTS
Linux (NVIDIA GPU)	`cuda`	`cuda`	Full GPU acceleration
Linux (Jetson SOC)	`cuda`	`cuda`	Jetson GPU (Nano, Orin, Xavier)
Linux (CPU only)	`cpu`	`cpu`	CPU multi-threading

Providers are auto-detected at runtime based on your platform. Kokoro TTS supports full CoreML acceleration on macOS and CUDA on Linux.

You can override providers with --provider (global), --stt-provider, and --tts-provider flags.

Tested Hardware

This solution has been designed and tested on the following platforms:

Device	CPU	Memory	Audio Device	Notes
Apple Mac Mini M4	Apple M4 (10-core)	16GB unified	AirPods Pro	Full CoreML acceleration (ANE)
NVIDIA Jetson Orin Nano Super	ARM Cortex-A78AE	8GB unified	AirPods Pro	Full CUDA acceleration

⚡ Running on Jetson Orin Nano? See JETSON_OPTIMIZATION.md for memory optimization strategies to run larger models on 8GB devices.

Minimum Hardware Requirements

Memory: 8GB minimum (unified memory recommended)
Storage: ~2GB for models
Audio: Bluetooth audio devices (tested with AirPods Pro) or USB/built-in microphone and speakers
GPU/Accelerator: Apple Silicon (M1/M2/M3/M4) with ANE, or NVIDIA GPU with CUDA support

Architecture

flowchart LR
    subgraph Pipeline
        A[🎤 Audio Capture<br/>malgo] --> B[🗣️ VAD + STT<br/>--stt-backend]
        B --> C[🧠 LLM<br/>Ollama]
        C --> D[📢 TTS<br/>--tts-backend]
        D --> E[🔊 Playback<br/>malgo]
        C -.->|Tool Calls| F[🔧 Tools<br/>Weather & Search]
        F -.->|Results| C
    end
    
    E -.->|Interrupt Flag<br/>speech detected| A

Prerequisites

Go 1.26 or later
CGO enabled (CGO_ENABLED=1)
Ollama running locally with a model loaded

Installing Go

macOS (Homebrew):

brew install go

macOS / Linux (official installer):

# Download and install the latest Go release from https://go.dev/dl/
# Example for Linux arm64 (adjust version and arch as needed):
curl -OL https://go.dev/dl/go1.26.1.linux-arm64.tar.gz
sudo tar -C /usr/local -xzf go1.26.1.linux-arm64.tar.gz
export PATH=$PATH:/usr/local/go/bin   # add to ~/.bashrc or ~/.zshrc

Verify the installation:

go version  # should print go1.26.0 or later

Platform-Specific Requirements

macOS:

Xcode Command Line Tools: xcode-select --install
CoreML is automatically available on macOS 10.13+

Linux (CPU):

ALSA development libraries: sudo apt install libasound2-dev

Linux (NVIDIA CUDA):

ALSA development libraries: sudo apt install libasound2-dev
NVIDIA GPU with CUDA support
NVIDIA Driver 450.80.02+
CUDA Toolkit 11.0+: sudo apt install nvidia-cuda-toolkit
cuDNN 8.0+ (optional, for optimal performance)

To verify CUDA is available:

nvidia-smi  # Should show your GPU
nvcc --version  # Should show CUDA version

Installation

1. Download Models

Build the application first (see step 2), then run the built-in setup command to download required models (default: ~900MB total):

./voice-assistant --setup

This downloads:

Silero-VAD: Voice activity detection model
Whisper tiny: Multilingual speech recognition model (int8 quantized, 99 languages)
Kokoro v1.0: Multilingual text-to-speech model with natural voices

Setup Options:

# Force re-download even if files exist
./voice-assistant --setup --force

# Custom model directory
./voice-assistant --setup --model-dir /custom/path

# Combine with a different Whisper model size
./voice-assistant --setup --stt-model small

The setup command is idempotent — it won't re-download existing files unless --force is used.

Choosing STT Model:

Model	Memory	Download	WER	Speed	Best For
`tiny`	~390MB	111MB	~5.0%	32x realtime	Jetson, edge devices
`base`	~740MB	198MB	~3.4%	16x realtime	Balanced accuracy/speed
`small`	~2.4GB	610MB	~2.2%	6x realtime	Desktop, best accuracy

2. Build the Application

./scripts/build.sh

Or manually:

CGO_ENABLED=1 go build -o voice-assistant ./cmd/assistant

The build automatically selects the correct platform-specific sherpa-onnx bindings:

macOS: Uses sherpa-onnx-go-macos with CoreML support
Linux: Uses sherpa-onnx-go-linux (CPU-only by default)

Building with CUDA Support (Linux)

The default sherpa-onnx-go-linux package includes CPU-only binaries. For true CUDA/GPU acceleration on NVIDIA hardware (including Jetson devices), you need to build sherpa-onnx from source with CUDA enabled.

The build script handles this automatically:

# Auto-detect: builds with CUDA if GPU and CUDA toolkit are found
./scripts/build.sh

# Force CUDA build (requires CUDA toolkit)
./scripts/build.sh --cuda

# Force CPU-only build (skip CUDA even if available)
./scripts/build.sh --cpu

CUDA Build Requirements:

NVIDIA GPU (discrete or Jetson SOC)
CUDA Toolkit (or JetPack for Jetson)
CMake 3.13+
Git
C++ compiler (gcc/g++)

The build script will:

Clone sherpa-onnx source (once)
Build with -DSHERPA_ONNX_ENABLE_GPU=ON
Install to .sherpa-onnx-cuda/ in your project
Link your build against the CUDA-enabled libraries

First build takes ~10-20 minutes depending on your hardware. Subsequent builds use the cached sherpa-onnx libraries.

Verify CUDA is working:

./run-voice-assistant.sh
# Should show: ⚡ STT acceleration: cuda, TTS acceleration: cuda
# Should NOT show: "Please compile with -DSHERPA_ONNX_ENABLE_GPU=ON" warnings

3. Start Ollama

Make sure Ollama is running with a model that supports tool calling:

# Pull the recommended model (supports tool calling + multilingual)
ollama pull qwen2.5:1.5b

# Start a chat to keep the model loaded
ollama run qwen2.5:1.5b

Note: The default model has changed from gemma3:1b to qwen2.5:1.5b to support agentic tool calling for weather and web search while keeping memory usage low.

4. Run the Assistant

macOS or Linux (CPU):

./voice-assistant

Linux with CUDA (recommended):

./run-voice-assistant.sh

The wrapper script automatically:

Sets up LD_LIBRARY_PATH for CUDA libraries
Detects Jetson hardware and pre-loads Ollama model to prevent memory fragmentation
Extracts model from command line args for proper pre-loading

Jetson Orin Nano Super users: See JETSON_OPTIMIZATION.md for memory optimization details.

Selecting STT / TTS Backend

The assistant supports pluggable STT and TTS backends via --stt-backend and --tts-backend. Each backend interprets --stt-model and --tts-voice in its own way.

# Defaults (equivalent to not passing the flags)
./voice-assistant --stt-backend whisper --tts-backend kokoro

Currently available backends:

STT: whisper (default)
TTS: kokoro (default)

To add a new backend, implement the Transcriber/Synthesizer interface and register it in the factory (see internal/stt/stt.go and internal/tts/tts.go).

Selecting STT Model

Use the --stt-model flag to choose the STT model:

# Use Whisper tiny model (default, recommended for most devices)
./voice-assistant --stt-model tiny

# Use Whisper base model (better accuracy, more memory)
./voice-assistant --stt-model base

# Use Whisper small model (best accuracy, requires more memory)
./voice-assistant --stt-model small

Model Comparison:

Model	Memory	Accuracy (WER)	Speed	Use Case
`tiny`	~390MB	~5.0%	32x RT	Jetson, Raspberry Pi, low-memory devices
`base`	~740MB	~3.4%	16x RT	Balanced for most systems
`small`	~2.4GB	~2.2%	6x RT	Desktop systems, best quality

For Jetson Orin Nano (8GB unified memory), tiny is critical to avoid OOM errors. See JETSON_OPTIMIZATION.md for details.

Agentic Capabilities

The voice assistant includes agentic tool calling powered by Ollama's function calling support. The LLM can proactively use tools to answer questions about current information it doesn't know.

Available Tools

🌤️ Weather Tool

Get current weather for any location worldwide
Supports city-based queries: "What's the weather in Tokyo?"
Automatic IP-based geolocation: "What's the weather here?"
Uses Open-Meteo API (no API key required)

🔍 Web Search Tool

Search the web for current information, news, facts, and events
Two backends:
- SearXNG (recommended): Privacy-respecting metasearch engine
- DuckDuckGo (fallback): Automatic fallback when SearXNG unavailable
Returns top 3 results formatted for voice output

Required LLM Model

Tool calling requires models that support function calling. The default model has been changed to qwen2.5:1.5b which supports:

✅ Multi-lingual conversations (15+ languages)
✅ Tool/function calling
✅ Fast and memory-efficient (~1GB)

⚠️ Tool Calling Accuracy Warning
Tool calling accuracy depends heavily on model size. While smaller models (1.5b, 3b) support function calling, they have reduced accuracy in determining when and how to use tools. The 7B models provide significantly better tool usage decisions. For memory-constrained devices like Jetson Orin Nano, this is a known trade-off between memory usage and tool calling reliability.

# Pull the default model (one time)
ollama pull qwen2.5:1.5b

# Or use larger models for better quality
ollama pull qwen2.5:3b   # ~2GB, better quality
ollama pull qwen2.5:7b   # ~4.9GB, best quality, excellent tool calling

Other compatible models:

qwen2.5:1.5b - Smaller/faster (~1GB)
qwen2.5:7b - Better quality (~4.7GB)
mistral:7b - Alternative with tool support

Usage Examples

Weather queries:

User: "What's the weather in Paris?"
Assistant: [Uses weather tool] "The weather for Paris, Île-de-France, FR: 
           Temperature is 12°C, feels like 10°C. Humidity is 75 percent."

User: "What's the weather here?" 
Assistant: [Uses weather tool with IP geolocation] "The weather for Chapel Hill..."

Web search queries:

User: "Who won the Super Bowl this year?"
Assistant: [Uses search tool] "The Kansas City Chiefs defeated..."

User: "What's the latest news about AI?"
Assistant: [Uses search tool] "Recent developments include..."

General conversation:

User: "Tell me a joke"
Assistant: [No tools needed] "Why did the scarecrow win an award?..."

Optional: SearXNG Setup

For privacy-focused web search, you can run your own SearXNG instance locally:

1. Configuration files

The repository includes pre-configured files in searxng/:

settings.yml - Optimized for minimal memory usage with Bing search
docker-compose.yml - Resource limits for edge devices (Jetson, etc.)

2. Start SearXNG with Docker Compose:

cd searxng

# If starting for the first time or after a stop:
docker compose up -d

# If container already exists (to restart):
docker compose restart

# Check status:
docker compose ps

cd ..

3. Verify SearXNG is working:

curl "http://localhost:8080/search?q=test&format=json"

4. Run voice assistant with SearXNG:

# Go version
./voice-assistant -searxng-url http://localhost:8080

# Rust version
./target/release/voice-assistant --searxng-url http://localhost:8080

5. Managing SearXNG:

cd searxng

# Stop (keeps container, quick restart):
docker compose stop

# Start stopped container:
docker compose start

# Restart running container:
docker compose restart

# Stop and remove container:
docker compose down

# View logs:
docker compose logs -f

Notes:

SearXNG is optional - the assistant falls back to DuckDuckGo if not configured
Configuration optimized for speed and minimal resource usage (~384MB RAM, 1 CPU core)
Supports multilingual queries (matches Whisper's 99-language support)
Currently configured with Bing search engine for best API reliability
For Jetson Orin Nano optimization, see JETSON_OPTIMIZATION.md

How Tool Calling Works

User asks a question requiring external information
LLM decides which tool(s) to call (or none)
Tools execute and return results
LLM synthesizes a natural response from tool results
TTS speaks the final answer

The system uses an agentic loop: LLM → Tool Calls → Tool Results → LLM → Final Answer. This happens automatically with no user intervention.

Multi-Language Support

Both Whisper (STT) and Kokoro (TTS) support multiple languages. The assistant can understand and respond in Spanish, French, Italian, Portuguese, Japanese, Chinese, and more.

How It Works

Speech Recognition (STT): Set your language with -stt-language (e.g., es for Spanish)
Text-to-Speech (TTS): Voice language is automatically detected from the voice name prefix:
- ef_*/em_* → Spanish (es)
- ff_* → French (fr)
- hf_*/hm_* → Hindi (hi)
- if_*/im_* → Italian (it)
- jf_*/jm_* → Japanese (ja)
- pf_*/pm_* → Portuguese BR (pt-br)
- af_*/am_* → American English
- bf_*/bm_* → British English
- zf_*/zm_* → Chinese (Mandarin)
LLM: Use a multilingual model like qwen2.5:1.5b or larger for proper language matching

Complete Spanish Example

To use the assistant entirely in Spanish:

# 1. Pull a multilingual LLM model (one time)
ollama pull qwen2.5:1.5b

# 2. Run with Spanish speech recognition + Spanish TTS voice
./voice-assistant \
  -ollama-model qwen2.5:1.5b \
  -stt-language es \
  -tts-voice ef_dora \
  -tts-speaker-id 28

What happens:

You speak in Spanish → Whisper transcribes it
Qwen2.5 responds in Spanish (it automatically detects the input language)
Kokoro synthesizes the response with the Spanish female voice (ef_dora)

Available Languages & Voices

Whisper supports 99 languages. Here are the most common with their Kokoro TTS voices:

Language	STT Code	TTS Voice Options	Speaker IDs
Spanish	`es`	`ef_dora` (female), `em_alex` (male)	28, 29
French	`fr`	`ff_siwis` (female)	33
Italian	`it`	`if_`, `im_` voices	varies
Portuguese	`pt`	`pf_`, `pm_` voices	varies
Japanese	`ja`	`jf_`, `jm_` voices	varies
Chinese	`zh`	`zf_`, `zm_` voices	varies
Hindi	`hi`	`hf_`, `hm_` voices	varies
English (US)	`en`	`af_bella`, `am_michael`, etc.	2, 16, ...
English (UK)	`en`	`bf_emma`, `bm_george`, etc.	21, 26, ...

For all 53 available voices: ./voice-assistant --list-voices

Multilingual LLM Models

The default qwen2.5:1.5b model provides excellent multilingual support. For even better quality:

Model	Size	Languages	Best For
qwen2.5:1.5b ⭐	~1GB	Good for 15+ languages	Default, fast
qwen2.5:3b	~2GB	Excellent for 15+ languages	Better quality
aya-expanse:8b	~4.9GB	Purpose-built for 23+ languages	Best quality
gemma2:2b	~1.6GB	Better than gemma3:1b	Alternative

More Examples

# French (automatic language in response)
./voice-assistant \
  -ollama-model qwen2.5:3b \
  -stt-language fr \
  -tts-voice ff_siwis \
  -tts-speaker-id 33

# Auto-detect input language (English, Spanish, French, etc.)
./voice-assistant \
  -ollama-model qwen2.5:3b \
  -stt-language auto \
  -tts-voice af_bella \
  -tts-speaker-id 2

# Japanese
./voice-assistant \
  -ollama-model qwen2.5:3b \
  -stt-language ja \
  -tts-voice jf_* \
  -tts-speaker-id <id>

Note: Qwen models automatically respond in the same language as your input without needing to modify the system prompt.

Examples

Basic usage (always listening):

# macOS or Linux CPU
./voice-assistant

# Linux with CUDA
./run-voice-assistant.sh

With wake word:

./run-voice-assistant.sh -wake-word "hey assistant"

Custom Ollama model:

./voice-assistant -ollama-model "mistral:7b"

Faster speech:

./voice-assistant -tts-speed 1.2

Verbose mode for debugging:

./voice-assistant -verbose

Force CPU-only inference (disable GPU):

./voice-assistant -provider cpu

Force CUDA on Linux (if auto-detect fails):

./run-voice-assistant.sh -provider cuda

Live Translation Use Case

The voice assistant can be configured as a real-time translator without changing a single line of code. By combining multilingual STT, strategic system prompts, and cross-language TTS, you can create a live translation device.

How It Works

Input Language (STT): Whisper transcribes speech in the source language
Translation (LLM): System prompt instructs the model to translate to target language
Output Language (TTS): Kokoro synthesizes the translation in the target language
Temperature Control: Lower temperature (0.1-0.3) ensures deterministic, accurate translations

Spanish → English Translation

# 1. Pull a multilingual LLM (one time)
ollama pull qwen2.5:3b

# 2. Run the translator
./voice-assistant \
  --ollama-model qwen2.5:3b \
  --stt-language es \
  --tts-voice af_bella \
  --tts-speaker-id 2 \
  --temperature 0.2 \
  --system-prompt "You are a Spanish-to-English translator. Translate the following Spanish text to natural English. Output only the English translation without any Spanish words or explanations. NEVER use markdown, asterisks, underscores, backticks, brackets, code blocks, bullet points, or special characters."

What happens:

You speak in Spanish: "Hola, ¿cómo estás?"
Whisper transcribes: "Hola, ¿cómo estás?"
Qwen translates: "Hello, how are you?"
Kokoro speaks in English: "Hello, how are you?"

English → Spanish Translation

./voice-assistant \
  --ollama-model qwen2.5:3b \
  --stt-language en \
  --tts-voice ef_dora \
  --tts-speaker-id 28 \
  --temperature 0.2 \
  --system-prompt "You are an English-to-Spanish translator. Translate the following English text to natural Spanish. Output only the Spanish translation without any English words or explanations. NEVER use markdown, asterisks, underscores, backticks, brackets, code blocks, bullet points, or special characters."

Other Language Combinations

French → English:

./voice-assistant \
  --ollama-model qwen2.5:3b \
  --stt-language fr \
  --tts-voice af_bella \
  --tts-speaker-id 2 \
  --temperature 0.2 \
  --system-prompt "You are a French-to-English translator. Translate the following French text to natural English. Output only the English translation. NEVER use markdown or special formatting."

Japanese → English:

./voice-assistant \
  --ollama-model qwen2.5:3b \
  --stt-language ja \
  --tts-voice af_bella \
  --tts-speaker-id 2 \
  --temperature 0.2 \
  --system-prompt "You are a Japanese-to-English translator. Translate the following Japanese text to natural English. Output only the English translation. NEVER use markdown or special formatting."

Key Configuration Parameters

Parameter	Purpose	Translation Value
`--stt-language`	Source language for transcription	`es`, `fr`, `ja`, etc.
`--tts-voice`	Target language voice	`af_bella` (English), `ef_dora` (Spanish), etc.
`--temperature`	Translation consistency	`0.1-0.3` (lower = more deterministic)
`--system-prompt`	Translation instructions	Must explicitly state "translate only"
`--ollama-model`	Multilingual model	`qwen2.5:3b` or `aya-expanse:8b`

Why Lower Temperature Matters

Temperature 0.7 (default): Model may mix languages or add conversational elements
- Example: "Hola, ¿y tú? How are you?" (mixed Spanish/English)
Temperature 0.2: Model provides deterministic, accurate translations
- Example: "Hello, how are you?" (pure English)

Lower temperature reduces creativity and increases consistency, which is ideal for translation tasks.

Recommended Models for Translation

Model	Size	Best For	Translation Quality
qwen2.5:3b ⭐	~2GB	General translation	Excellent
aya-expanse:8b	~4.9GB	Best quality	Superior (purpose-built for multilingual)
qwen2.5:1.5b	~1GB	Resource-constrained devices	Good

Interrupt Mode: Handling Acoustic Feedback

The assistant supports two modes for managing playback interruption when speech is detected:

Understanding the Problem

When using headsets (headphones + microphone), the system works perfectly: the microphone only captures your voice, so interrupting playback when you speak is straightforward.

However, with open mic/speaker setups (external speakers + separate microphone), the assistant's own voice output can be captured by the microphone, causing unwanted self-interruption. This is known as acoustic feedback or acoustic echo.

Available Modes

Use the --interrupt-mode flag to select the appropriate behavior for your audio setup:

`always` Mode (Best for Headsets)

./voice-assistant -interrupt-mode always

Use when: Using headphones or headset
Behavior: Immediately interrupts playback when speech is detected
Advantage: Natural conversation flow, can interrupt the assistant mid-sentence
Limitation: Will self-interrupt with open speakers (assistant's voice triggers VAD)

`wait` Mode (Best for Open Speakers) - Default

./voice-assistant -interrupt-mode wait

Use when: Using external speakers with separate microphone
Behavior: Pauses microphone capture during playback, resumes after with configurable delay
Advantage: Prevents acoustic feedback and self-interruption
Limitation: Cannot interrupt assistant mid-sentence, must wait for response to complete
Delay: Use -post-playback-delay-ms 300 to adjust resume delay (default 300ms)

Example Usage

# For headset users (natural interruption)
./voice-assistant -interrupt-mode always

# For open mic/speaker setup (prevent feedback)
./voice-assistant -interrupt-mode wait -post-playback-delay-ms 500

# Optimize audio buffer for wired/built-in audio (lower latency)
./voice-assistant -audio-buffer-ms 20

# Default buffer works best for Bluetooth devices (100ms)
./voice-assistant  # Uses 100ms buffer by default

Audio Buffer Configuration

The audio buffer size affects latency and compatibility with different audio devices:

Buffer Size	Best For	Latency	Notes
100ms (default)	Bluetooth devices	Higher	Prevents distortion with AirPods, etc.
20ms	Wired/USB/Built-in	Low	More responsive, real-time feel
50ms	Mixed use	Medium	Balance between latency and stability

Usage:

# For Bluetooth devices (default, recommended for AirPods)
./voice-assistant

# For wired or built-in audio
./voice-assistant -audio-buffer-ms 20

Why this matters: Bluetooth audio has inherent latency (100-200ms), so using a small buffer (20ms) can cause audio underruns and distortion. The 100ms default matches Bluetooth's characteristics.

Technical Background

Why is this a problem?

Voice activity detection (VAD) analyzes audio energy and spectral features
The assistant's synthesized voice has similar characteristics to human speech
Without isolation, VAD cannot distinguish between user speech and playback

Why not use echo cancellation?

Acoustic Echo Cancellation (AEC) requires significant computational resources
Cross-platform AEC libraries have varying quality and platform-specific implementations
On Linux, system-level solutions (PipeWire, PulseAudio) can provide AEC with proper configuration
On macOS, Core Audio's VoiceProcessingIO provides AEC but requires platform-specific integration

The wait mode provides a simple, reliable solution that works consistently across all platforms without additional computational overhead.

Adding TTS Voices

Kokoro TTS includes multiple voices in a single model. You can change voices using the -tts-voice and -tts-speaker-id flags:

Available Kokoro Voices (English)

American Voices:

Name	Speaker ID	Quality	Description
`af_heart`	3	A	American female, flagship voice
`af_bella`	2	A-	American female, high quality (default)
`af_nicole`	6	B-	American female, good quality
`af_sarah`	9	C+	American female, warm
`af_sky`	10	C-	American female, youthful
`am_adam`	11	F+	American male, basic quality
`am_michael`	16	C+	American male, medium quality

British Voices:

Name	Speaker ID	Quality	Description
`bf_emma`	21	B-	British female, recommended
`bf_isabella`	22	C	British female, medium quality
`bm_george`	26	C	British male, medium quality
`bm_lewis`	27	D+	British male, basic quality

For a complete list of all 53 voices (including Spanish, French, Japanese, and more), run:

./voice-assistant --list-voices

Changing Voices

# Use British male voice (George)
./voice-assistant -tts-voice bm_george -tts-speaker-id 26

# Use American female voice (Nicole)
./voice-assistant -tts-voice af_nicole -tts-speaker-id 6

Viewing Available Voices

To see all 53 available Kokoro voices with their speaker IDs, quality grades, and descriptions:

# List all voices
./voice-assistant --list-voices

# Get details for a specific voice
./voice-assistant --voice-info af_bella

Project Structure

voice-assistant/
├── cmd/
│   └── assistant/
│       └── main.go           # Main entry point, pipeline orchestration
├── internal/
│   ├── audio/
│   │   ├── capture.go        # Microphone audio capture (malgo)
│   │   └── playback.go       # Audio playback with interrupt support
│   ├── config/
│   │   └── config.go         # CLI flags and configuration
│   ├── llm/
│   │   └── client.go         # Ollama API client
│   ├── setup/
│   │   ├── download.go       # HTTP download and tar.bz2 extraction helpers
│   │   └── setup.go          # --setup orchestration (model download & verification)
│   ├── sherpa/
│   │   ├── sherpa_darwin.go  # macOS-specific sherpa-onnx bindings (CoreML)
│   │   └── sherpa_linux.go   # Linux-specific sherpa-onnx bindings (CUDA)
│   ├── stt/
│   │   ├── stt.go            # VoiceDetector, Transcriber interfaces + factory
│   │   ├── silero.go         # Silero VAD implementation
│   │   ├── whisper.go        # Whisper transcription implementation
│   │   └── processor.go      # STT processing goroutine
│   └── tts/
│       ├── tts.go            # Synthesizer interface + factory
│       ├── kokoro.go         # Kokoro TTS implementation
│       ├── text.go           # Sentence splitting utilities
│       └── processor.go      # TTS playback pipeline goroutine
├── scripts/
│   └── build.sh              # Build script with CUDA support
├── go.mod
└── README.md

Models

Component	Model	Size	Purpose
VAD	Silero-VAD	~2MB	Speech boundary detection
STT	Whisper small (int8)	~150MB	Speech recognition
TTS	Kokoro v1.0	~311MB	Expressive voice synthesis

Alternative Models

You can select a different STT model with --stt-model:

STT alternatives:

tiny - Fastest, ~5% WER (default)
base - Balance of speed/accuracy
small - Higher accuracy, slower

TTS voices (Kokoro built-in):

af_bella (speaker ID 2) - American female, high quality (default)
af_heart (speaker ID 3) - American female, flagship voice
bf_emma (speaker ID 21) - British female, recommended
am_adam (speaker ID 11) - American male

For all 53 voices across 9 languages, run: ./voice-assistant --list-voices

Latency Considerations

This implementation uses OfflineRecognizer (batch processing) rather than OnlineRecognizer (streaming) because:

VAD pre-segments audio: The Silero-VAD detects complete utterances before transcription
Whisper accuracy: Whisper performs best on complete audio segments
Practical latency: The VAD adds minimal delay (~250ms silence detection), and Whisper processes quickly on modern hardware

For even lower latency, you could:

Use a streaming model (Zipformer, Paraformer) with OnlineRecognizer
Reduce VAD silence threshold
Use smaller Whisper model (tiny.en)

Hardware Acceleration Details

CoreML (macOS)

CoreML is Apple's machine learning framework that automatically leverages:

Apple Neural Engine (ANE) on M1/M2/M3/M4 chips for maximum efficiency
GPU acceleration on Intel Macs with discrete graphics
CPU fallback when specialized hardware is unavailable

No additional installation required - CoreML is built into macOS.

CUDA (Linux)

NVIDIA CUDA enables GPU-accelerated inference on Linux, supporting both discrete GPUs and Jetson SOC devices.

Supported Hardware:

Discrete NVIDIA GPUs (GTX 10xx series or newer)
NVIDIA Jetson devices (Nano, Orin, Xavier, AGX)

Requirements for Discrete GPUs:

NVIDIA Driver 450.80.02 or later
CUDA Toolkit 11.0 or later

Installation (Ubuntu/Debian):

# Install NVIDIA driver (if not already installed)
sudo apt install nvidia-driver-535

# Install CUDA toolkit
sudo apt install nvidia-cuda-toolkit

# Verify installation
nvidia-smi
nvcc --version

Jetson Devices: Jetson devices (Nano, Orin, Xavier) come with JetPack SDK which includes CUDA support out of the box. The auto-detection will recognize Jetson devices via:

/dev/nvhost-gpu device
/etc/nv_tegra_release file
Tegra identifiers in device tree

Important: The build script automatically selects the correct ONNX Runtime version based on your CUDA version. Different Jetson boards require different versions:

Jetson Device	JetPack	CUDA Version	ONNX Runtime
Jetson Nano B01	4.x	CUDA 10.2	1.11.0
Jetson Orin NX	5.x	CUDA 11.4	1.16.0
Jetson Orin (JetPack 6.x)	6.x	CUDA 12.2	1.18.0
Jetson Orin (JetPack 6.2+)	6.2+	CUDA 12.6	1.18.1

The build script detects your CUDA version automatically and downloads the matching ONNX Runtime.

Running on Jetson:

# Build with CUDA support
./scripts/build.sh --cuda

# Run using the wrapper script (recommended - sets up library paths)
./run-voice-assistant.sh

# Or run directly if paths are configured
./voice-assistant

If you see errors like libcublas.so.X: cannot open shared object file, it means there's a CUDA version mismatch. The wrapper script sets up LD_LIBRARY_PATH to help resolve this.

Optional: Install cuDNN for optimal performance:

# Download cuDNN from NVIDIA (requires account)
# https://developer.nvidia.com/cudnn
sudo dpkg -i cudnn-local-repo-*.deb
sudo apt update
sudo apt install libcudnn8

The voice assistant will automatically detect CUDA availability and use GPU acceleration.

VS Code Development Setup

When developing this cross-platform project in VS Code, gopls (the Go language server) may show errors for platform-specific code that doesn't apply to your current OS. For example, on macOS you might see [linux,amd64] errors for Linux-specific imports.

This project uses Go build constraints (//go:build darwin / //go:build linux) to provide platform-specific sherpa-onnx bindings. By default, gopls may analyze files for all platforms, causing spurious errors for code that won't run on your current OS. Setting GOOS and GOARCH tells gopls to analyze only for your platform.

Upgrading Dependencies

Version Compatibility Overview

This project uses sherpa-onnx for speech processing. Version management differs by platform:

Platform	How It Works	Version Check
macOS	Uses pre-built `sherpa-onnx-go-macos` bindings	Automatic (handled by bindings)
Linux (CPU)	Uses pre-built `sherpa-onnx-go-linux` bindings	Automatic (handled by bindings)
Linux (CUDA)	Compiles sherpa-onnx from source	Manual sync required

CUDA Build Version Requirements

For Linux CUDA builds, three files must stay in sync:

File	What to Update	Current Value
`go.mod`	`sherpa-onnx-go-linux` and `sherpa-onnx-go-macos` versions	`v1.12.x`
`scripts/build.sh`	`SHERPA_VERSION` variable	`v1.12.x`

The build script includes a sanity check that fails with clear instructions if versions mismatch.

ONNX Runtime Compatibility Matrix

The build script automatically selects the correct ONNX Runtime version based on your CUDA version:

CUDA Version	ONNX Runtime	Use Case
10.2.x	1.11.0	Jetson Nano (JetPack 4.x)
11.4.x	1.16.0	Jetson Orin NX (JetPack 5.x)
11.x	1.16.0	Generic CUDA 11
12.2.x	1.18.0	CUDA 12.2 with cuDNN8
12.6.x+	1.18.1	JetPack 6.2+ (cuDNN9)
12.x	1.18.1	Generic CUDA 12

Upgrade Procedure

Check for new releases:
- Visit sherpa-onnx releases
- Visit sherpa-onnx-go-linux

Update go.mod:

go get github.com/k2-fsa/sherpa-onnx-go-linux@vX.Y.Z
go get github.com/k2-fsa/sherpa-onnx-go-macos@vX.Y.Z
go mod tidy

Update build script:
- Edit scripts/build.sh
- Update SHERPA_VERSION="vX.Y.Z" to match
Test on macOS (easy path):
```
./scripts/build.sh
./voice-assistant
```
Test on Linux with CUDA:
```
./scripts/build.sh --clean --cuda
./run-voice-assistant.sh
```
- Watch for ABI mismatch errors or "Please compile with -DSHERPA_ONNX_ENABLE_GPU=ON" warnings
- Verify provider shows cuda not cpu
If CUDA build fails:
- Check if the ONNX Runtime version mapping needs updating
- Review sherpa-onnx release notes for breaking changes
- The ONNX Runtime mapping in scripts/build.sh may need new entries for newer CUDA versions

Portable Deployment

For CUDA builds on Linux, the runtime libraries are installed to ~/.voice-assistant/go/lib/. This enables portable deployment:

On the build machine:

./scripts/build.sh --cuda

To deploy to another machine:

Copy the ~/.voice-assistant/go directory (contains CUDA libraries)
Copy the ~/.voice-assistant/models directory (shared model files)
Copy the voice-assistant binary
Copy the run-voice-assistant.sh wrapper script

On the target machine:

# Ensure CUDA toolkit is installed, then run:
./run-voice-assistant.sh

The wrapper script automatically sets LD_LIBRARY_PATH to find libraries in ~/.voice-assistant/go/lib/.

macOS Note: On macOS with CoreML, the Go binary is statically linked and doesn't require runtime libraries. Just copy the binary and ~/.voice-assistant/models/ directory.

Troubleshooting

"Failed to create VAD" or "Failed to create offline recognizer"

Run initial setup to download required models: ./voice-assistant --setup (use --force to re-download if needed)
Ensure the model directory exists and is writable (default: ~/.voice-assistant/models/, or as set via --model-dir)
Check that model paths and any --model-dir override match your configuration
Verify sherpa-onnx is properly installed

"Cannot reach Ollama"

Start Ollama: ollama serve
Load a model: ollama run qwen2.5:1.5b
Check the host URL matches: -ollama-host http://localhost:11434

No audio capture

Check microphone permissions (macOS: System Preferences → Privacy → Microphone)
Verify microphone is connected and working
Try running with -verbose to see audio processing logs

Build errors with CGO

Ensure CGO is enabled: export CGO_ENABLED=1
Install required system libraries for your platform

CUDA errors on Linux

Verify NVIDIA driver: nvidia-smi
Check CUDA version: nvcc --version
Try forcing CPU mode: ./voice-assistant -provider cpu
Ensure CUDA libraries are in LD_LIBRARY_PATH

CoreML errors on macOS

Ensure macOS 10.13 or later
Try forcing CPU mode: ./voice-assistant -provider cpu

Acknowledgments

This project builds upon the excellent work of many open-source libraries and models:

Core Libraries & Models

sherpa-onnx - Speech recognition and synthesis framework (Apache-2.0)
Silero VAD - Voice activity detection model (MIT)
OpenAI Whisper - Multilingual speech recognition model (MIT)
Kokoro - Expressive neural text-to-speech model (MIT/Apache-2.0)
Ollama - Local LLM inference engine (MIT)

Go Implementation Dependencies

Apache-2.0:

Unlicense:

malgo

BSD-3-Clause:

Go standard library

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

All dependencies use permissive licenses (MIT, Apache-2.0, BSD-3-Clause, Unlicense) that are compatible with the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
cmd/assistant		cmd/assistant
internal		internal
rust-impl		rust-impl
scripts		scripts
searxng		searxng
.gitignore		.gitignore
AGENTS.md		AGENTS.md
JETSON_OPTIMIZATION.md		JETSON_OPTIMIZATION.md
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

Voice Assistant (Go + sherpa-onnx)

Features

Cross-Platform Support

Tested Hardware

Minimum Hardware Requirements

Architecture

Prerequisites

Installing Go

Platform-Specific Requirements

Installation

1. Download Models

2. Build the Application

Building with CUDA Support (Linux)

3. Start Ollama

4. Run the Assistant

Selecting STT / TTS Backend

Selecting STT Model

Agentic Capabilities

Available Tools

Required LLM Model

Usage Examples

Optional: SearXNG Setup

How Tool Calling Works

Multi-Language Support

How It Works

Complete Spanish Example

Available Languages & Voices

Multilingual LLM Models

More Examples

Examples

Live Translation Use Case

How It Works

Spanish → English Translation

English → Spanish Translation

Other Language Combinations

Key Configuration Parameters

Why Lower Temperature Matters

Recommended Models for Translation

Interrupt Mode: Handling Acoustic Feedback

Understanding the Problem

Available Modes

always Mode (Best for Headsets)

wait Mode (Best for Open Speakers) - Default

Example Usage

Audio Buffer Configuration

Technical Background

Adding TTS Voices

Available Kokoro Voices (English)

Changing Voices

Viewing Available Voices

Project Structure

Models

Alternative Models

Latency Considerations

Hardware Acceleration Details

CoreML (macOS)

CUDA (Linux)

VS Code Development Setup

Upgrading Dependencies

Version Compatibility Overview

CUDA Build Version Requirements

ONNX Runtime Compatibility Matrix

Upgrade Procedure

Portable Deployment

Troubleshooting

"Failed to create VAD" or "Failed to create offline recognizer"

"Cannot reach Ollama"

No audio capture

Build errors with CGO

CUDA errors on Linux

CoreML errors on macOS

Acknowledgments

Core Libraries & Models

Go Implementation Dependencies

License

About

`always` Mode (Best for Headsets)

`wait` Mode (Best for Open Speakers) - Default

Packages