A real-time voice assistant that runs entirely locally implemented in Go using sherpa-onnx for speech recognition and synthesis.
This is my first foray into the world of AI-powered voice assistants. As a fan of hyper-efficient code and with edge devices in mind, I’m avoiding Python and instead building the assistant in Go for learning purposes and to gain experience developing applications with CoreML and CUDA support.
For the LLM processing, I'm relying on Ollama, as it works perfectly for this use case, even though I'm just scratching the surface of what I can do with it.
- Voice Activity Detection (VAD): Silero-VAD for accurate speech boundary detection
- Speech-to-Text (STT): Pluggable backend (
--stt-backend); ships with Whisper multilingual model for high-quality transcription (99 languages) - Text-to-Speech (TTS): Pluggable backend (
--tts-backend); ships with Kokoro for natural-sounding voice synthesis with emotional expression - LLM Integration: Ollama API for conversational AI with agentic tool calling
- Agentic Tools: Weather information and web search capabilities
- Low Latency: Optimized for real-time conversation
- Interrupt Support: Optional stop playback when user starts speaking
- Wake Word: Optional wake word activation
- Hardware Acceleration: Auto-detected CoreML (macOS) and CUDA (Linux)
- Multilingual: Both STT and TTS support multiple languages (English, Spanish, French, German, etc.)
- Live Translation: Zero-code configuration for real-time language translation
- Configurable Temperature: Adjustable LLM temperature for translation vs. conversational tasks
- Shared Assets: Models stored in
~/.voice-assistant
This implementation supports multiple platforms with hardware acceleration:
| Platform | STT Provider | TTS Provider | Notes |
|---|---|---|---|
| macOS (Intel) | coreml |
coreml |
Full CoreML acceleration |
| macOS (Apple Silicon) | coreml |
coreml |
ANE for STT and TTS |
| Linux (NVIDIA GPU) | cuda |
cuda |
Full GPU acceleration |
| Linux (Jetson SOC) | cuda |
cuda |
Jetson GPU (Nano, Orin, Xavier) |
| Linux (CPU only) | cpu |
cpu |
CPU multi-threading |
Providers are auto-detected at runtime based on your platform. Kokoro TTS supports full CoreML acceleration on macOS and CUDA on Linux.
You can override providers with --provider (global), --stt-provider, and --tts-provider flags.
This solution has been designed and tested on the following platforms:
| Device | CPU | Memory | Audio Device | Notes |
|---|---|---|---|---|
| Apple Mac Mini M4 | Apple M4 (10-core) | 16GB unified | AirPods Pro | Full CoreML acceleration (ANE) |
| NVIDIA Jetson Orin Nano Super | ARM Cortex-A78AE | 8GB unified | AirPods Pro | Full CUDA acceleration |
⚡ Running on Jetson Orin Nano? See JETSON_OPTIMIZATION.md for memory optimization strategies to run larger models on 8GB devices.
- Memory: 8GB minimum (unified memory recommended)
- Storage: ~2GB for models
- Audio: Bluetooth audio devices (tested with AirPods Pro) or USB/built-in microphone and speakers
- GPU/Accelerator: Apple Silicon (M1/M2/M3/M4) with ANE, or NVIDIA GPU with CUDA support
flowchart LR
subgraph Pipeline
A[🎤 Audio Capture<br/>malgo] --> B[🗣️ VAD + STT<br/>--stt-backend]
B --> C[🧠 LLM<br/>Ollama]
C --> D[📢 TTS<br/>--tts-backend]
D --> E[🔊 Playback<br/>malgo]
C -.->|Tool Calls| F[🔧 Tools<br/>Weather & Search]
F -.->|Results| C
end
E -.->|Interrupt Flag<br/>speech detected| A
- Go 1.26 or later
- CGO enabled (
CGO_ENABLED=1) - Ollama running locally with a model loaded
macOS (Homebrew):
brew install gomacOS / Linux (official installer):
# Download and install the latest Go release from https://go.dev/dl/
# Example for Linux arm64 (adjust version and arch as needed):
curl -OL https://go.dev/dl/go1.26.1.linux-arm64.tar.gz
sudo tar -C /usr/local -xzf go1.26.1.linux-arm64.tar.gz
export PATH=$PATH:/usr/local/go/bin # add to ~/.bashrc or ~/.zshrcVerify the installation:
go version # should print go1.26.0 or latermacOS:
- Xcode Command Line Tools:
xcode-select --install - CoreML is automatically available on macOS 10.13+
Linux (CPU):
- ALSA development libraries:
sudo apt install libasound2-dev
Linux (NVIDIA CUDA):
- ALSA development libraries:
sudo apt install libasound2-dev - NVIDIA GPU with CUDA support
- NVIDIA Driver 450.80.02+
- CUDA Toolkit 11.0+:
sudo apt install nvidia-cuda-toolkit - cuDNN 8.0+ (optional, for optimal performance)
To verify CUDA is available:
nvidia-smi # Should show your GPU
nvcc --version # Should show CUDA versionBuild the application first (see step 2), then run the built-in setup command to download required models (default: ~900MB total):
./voice-assistant --setupThis downloads:
- Silero-VAD: Voice activity detection model
- Whisper tiny: Multilingual speech recognition model (int8 quantized, 99 languages)
- Kokoro v1.0: Multilingual text-to-speech model with natural voices
Setup Options:
# Force re-download even if files exist
./voice-assistant --setup --force
# Custom model directory
./voice-assistant --setup --model-dir /custom/path
# Combine with a different Whisper model size
./voice-assistant --setup --stt-model smallThe setup command is idempotent — it won't re-download existing files unless --force is used.
Choosing STT Model:
| Model | Memory | Download | WER | Speed | Best For |
|---|---|---|---|---|---|
tiny |
~390MB | 111MB | ~5.0% | 32x realtime | Jetson, edge devices |
base |
~740MB | 198MB | ~3.4% | 16x realtime | Balanced accuracy/speed |
small |
~2.4GB | 610MB | ~2.2% | 6x realtime | Desktop, best accuracy |
./scripts/build.shOr manually:
CGO_ENABLED=1 go build -o voice-assistant ./cmd/assistantThe build automatically selects the correct platform-specific sherpa-onnx bindings:
- macOS: Uses
sherpa-onnx-go-macoswith CoreML support - Linux: Uses
sherpa-onnx-go-linux(CPU-only by default)
The default sherpa-onnx-go-linux package includes CPU-only binaries. For true CUDA/GPU acceleration on NVIDIA hardware (including Jetson devices), you need to build sherpa-onnx from source with CUDA enabled.
The build script handles this automatically:
# Auto-detect: builds with CUDA if GPU and CUDA toolkit are found
./scripts/build.sh
# Force CUDA build (requires CUDA toolkit)
./scripts/build.sh --cuda
# Force CPU-only build (skip CUDA even if available)
./scripts/build.sh --cpuCUDA Build Requirements:
- NVIDIA GPU (discrete or Jetson SOC)
- CUDA Toolkit (or JetPack for Jetson)
- CMake 3.13+
- Git
- C++ compiler (gcc/g++)
The build script will:
- Clone sherpa-onnx source (once)
- Build with
-DSHERPA_ONNX_ENABLE_GPU=ON - Install to
.sherpa-onnx-cuda/in your project - Link your build against the CUDA-enabled libraries
First build takes ~10-20 minutes depending on your hardware. Subsequent builds use the cached sherpa-onnx libraries.
Verify CUDA is working:
./run-voice-assistant.sh
# Should show: ⚡ STT acceleration: cuda, TTS acceleration: cuda
# Should NOT show: "Please compile with -DSHERPA_ONNX_ENABLE_GPU=ON" warningsMake sure Ollama is running with a model that supports tool calling:
# Pull the recommended model (supports tool calling + multilingual)
ollama pull qwen2.5:1.5b
# Start a chat to keep the model loaded
ollama run qwen2.5:1.5bNote: The default model has changed from gemma3:1b to qwen2.5:1.5b to support agentic tool calling for weather and web search while keeping memory usage low.
macOS or Linux (CPU):
./voice-assistantLinux with CUDA (recommended):
./run-voice-assistant.shThe wrapper script automatically:
- Sets up
LD_LIBRARY_PATHfor CUDA libraries - Detects Jetson hardware and pre-loads Ollama model to prevent memory fragmentation
- Extracts model from command line args for proper pre-loading
Jetson Orin Nano Super users: See JETSON_OPTIMIZATION.md for memory optimization details.
The assistant supports pluggable STT and TTS backends via --stt-backend and --tts-backend. Each backend interprets --stt-model and --tts-voice in its own way.
# Defaults (equivalent to not passing the flags)
./voice-assistant --stt-backend whisper --tts-backend kokoroCurrently available backends:
- STT:
whisper(default) - TTS:
kokoro(default)
To add a new backend, implement the Transcriber/Synthesizer interface and register it in the factory (see internal/stt/stt.go and internal/tts/tts.go).
Use the --stt-model flag to choose the STT model:
# Use Whisper tiny model (default, recommended for most devices)
./voice-assistant --stt-model tiny
# Use Whisper base model (better accuracy, more memory)
./voice-assistant --stt-model base
# Use Whisper small model (best accuracy, requires more memory)
./voice-assistant --stt-model smallModel Comparison:
| Model | Memory | Accuracy (WER) | Speed | Use Case |
|---|---|---|---|---|
tiny |
~390MB | ~5.0% | 32x RT | Jetson, Raspberry Pi, low-memory devices |
base |
~740MB | ~3.4% | 16x RT | Balanced for most systems |
small |
~2.4GB | ~2.2% | 6x RT | Desktop systems, best quality |
For Jetson Orin Nano (8GB unified memory), tiny is critical to avoid OOM errors. See JETSON_OPTIMIZATION.md for details.
The voice assistant includes agentic tool calling powered by Ollama's function calling support. The LLM can proactively use tools to answer questions about current information it doesn't know.
🌤️ Weather Tool
- Get current weather for any location worldwide
- Supports city-based queries: "What's the weather in Tokyo?"
- Automatic IP-based geolocation: "What's the weather here?"
- Uses Open-Meteo API (no API key required)
🔍 Web Search Tool
- Search the web for current information, news, facts, and events
- Two backends:
- SearXNG (recommended): Privacy-respecting metasearch engine
- DuckDuckGo (fallback): Automatic fallback when SearXNG unavailable
- Returns top 3 results formatted for voice output
Tool calling requires models that support function calling. The default model has been changed to qwen2.5:1.5b which supports:
- ✅ Multi-lingual conversations (15+ languages)
- ✅ Tool/function calling
- ✅ Fast and memory-efficient (~1GB)
⚠️ Tool Calling Accuracy Warning
Tool calling accuracy depends heavily on model size. While smaller models (1.5b, 3b) support function calling, they have reduced accuracy in determining when and how to use tools. The 7B models provide significantly better tool usage decisions. For memory-constrained devices like Jetson Orin Nano, this is a known trade-off between memory usage and tool calling reliability.
# Pull the default model (one time)
ollama pull qwen2.5:1.5b
# Or use larger models for better quality
ollama pull qwen2.5:3b # ~2GB, better quality
ollama pull qwen2.5:7b # ~4.9GB, best quality, excellent tool callingOther compatible models:
qwen2.5:1.5b- Smaller/faster (~1GB)qwen2.5:7b- Better quality (~4.7GB)mistral:7b- Alternative with tool support
Weather queries:
User: "What's the weather in Paris?"
Assistant: [Uses weather tool] "The weather for Paris, Île-de-France, FR:
Temperature is 12°C, feels like 10°C. Humidity is 75 percent."
User: "What's the weather here?"
Assistant: [Uses weather tool with IP geolocation] "The weather for Chapel Hill..."
Web search queries:
User: "Who won the Super Bowl this year?"
Assistant: [Uses search tool] "The Kansas City Chiefs defeated..."
User: "What's the latest news about AI?"
Assistant: [Uses search tool] "Recent developments include..."
General conversation:
User: "Tell me a joke"
Assistant: [No tools needed] "Why did the scarecrow win an award?..."
For privacy-focused web search, you can run your own SearXNG instance locally:
1. Configuration files
The repository includes pre-configured files in searxng/:
settings.yml- Optimized for minimal memory usage with Bing searchdocker-compose.yml- Resource limits for edge devices (Jetson, etc.)
2. Start SearXNG with Docker Compose:
cd searxng
# If starting for the first time or after a stop:
docker compose up -d
# If container already exists (to restart):
docker compose restart
# Check status:
docker compose ps
cd ..3. Verify SearXNG is working:
curl "http://localhost:8080/search?q=test&format=json"4. Run voice assistant with SearXNG:
# Go version
./voice-assistant -searxng-url http://localhost:8080
# Rust version
./target/release/voice-assistant --searxng-url http://localhost:80805. Managing SearXNG:
cd searxng
# Stop (keeps container, quick restart):
docker compose stop
# Start stopped container:
docker compose start
# Restart running container:
docker compose restart
# Stop and remove container:
docker compose down
# View logs:
docker compose logs -fNotes:
- SearXNG is optional - the assistant falls back to DuckDuckGo if not configured
- Configuration optimized for speed and minimal resource usage (~384MB RAM, 1 CPU core)
- Supports multilingual queries (matches Whisper's 99-language support)
- Currently configured with Bing search engine for best API reliability
- For Jetson Orin Nano optimization, see JETSON_OPTIMIZATION.md
- User asks a question requiring external information
- LLM decides which tool(s) to call (or none)
- Tools execute and return results
- LLM synthesizes a natural response from tool results
- TTS speaks the final answer
The system uses an agentic loop: LLM → Tool Calls → Tool Results → LLM → Final Answer. This happens automatically with no user intervention.
Both Whisper (STT) and Kokoro (TTS) support multiple languages. The assistant can understand and respond in Spanish, French, Italian, Portuguese, Japanese, Chinese, and more.
-
Speech Recognition (STT): Set your language with
-stt-language(e.g.,esfor Spanish) -
Text-to-Speech (TTS): Voice language is automatically detected from the voice name prefix:
ef_*/em_*→ Spanish (es)ff_*→ French (fr)hf_*/hm_*→ Hindi (hi)if_*/im_*→ Italian (it)jf_*/jm_*→ Japanese (ja)pf_*/pm_*→ Portuguese BR (pt-br)af_*/am_*→ American Englishbf_*/bm_*→ British Englishzf_*/zm_*→ Chinese (Mandarin)
-
LLM: Use a multilingual model like
qwen2.5:1.5bor larger for proper language matching
To use the assistant entirely in Spanish:
# 1. Pull a multilingual LLM model (one time)
ollama pull qwen2.5:1.5b
# 2. Run with Spanish speech recognition + Spanish TTS voice
./voice-assistant \
-ollama-model qwen2.5:1.5b \
-stt-language es \
-tts-voice ef_dora \
-tts-speaker-id 28What happens:
- You speak in Spanish → Whisper transcribes it
- Qwen2.5 responds in Spanish (it automatically detects the input language)
- Kokoro synthesizes the response with the Spanish female voice (ef_dora)
Whisper supports 99 languages. Here are the most common with their Kokoro TTS voices:
| Language | STT Code | TTS Voice Options | Speaker IDs |
|---|---|---|---|
| Spanish | es |
ef_dora (female), em_alex (male) |
28, 29 |
| French | fr |
ff_siwis (female) |
33 |
| Italian | it |
if_*, im_* voices |
varies |
| Portuguese | pt |
pf_*, pm_* voices |
varies |
| Japanese | ja |
jf_*, jm_* voices |
varies |
| Chinese | zh |
zf_*, zm_* voices |
varies |
| Hindi | hi |
hf_*, hm_* voices |
varies |
| English (US) | en |
af_bella, am_michael, etc. |
2, 16, ... |
| English (UK) | en |
bf_emma, bm_george, etc. |
21, 26, ... |
For all 53 available voices: ./voice-assistant --list-voices
The default qwen2.5:1.5b model provides excellent multilingual support. For even better quality:
| Model | Size | Languages | Best For |
|---|---|---|---|
| qwen2.5:1.5b ⭐ | ~1GB | Good for 15+ languages | Default, fast |
| qwen2.5:3b | ~2GB | Excellent for 15+ languages | Better quality |
| aya-expanse:8b | ~4.9GB | Purpose-built for 23+ languages | Best quality |
| gemma2:2b | ~1.6GB | Better than gemma3:1b | Alternative |
# French (automatic language in response)
./voice-assistant \
-ollama-model qwen2.5:3b \
-stt-language fr \
-tts-voice ff_siwis \
-tts-speaker-id 33
# Auto-detect input language (English, Spanish, French, etc.)
./voice-assistant \
-ollama-model qwen2.5:3b \
-stt-language auto \
-tts-voice af_bella \
-tts-speaker-id 2
# Japanese
./voice-assistant \
-ollama-model qwen2.5:3b \
-stt-language ja \
-tts-voice jf_* \
-tts-speaker-id <id>Note: Qwen models automatically respond in the same language as your input without needing to modify the system prompt.
Basic usage (always listening):
# macOS or Linux CPU
./voice-assistant
# Linux with CUDA
./run-voice-assistant.shWith wake word:
./run-voice-assistant.sh -wake-word "hey assistant"Custom Ollama model:
./voice-assistant -ollama-model "mistral:7b"Faster speech:
./voice-assistant -tts-speed 1.2Verbose mode for debugging:
./voice-assistant -verboseForce CPU-only inference (disable GPU):
./voice-assistant -provider cpuForce CUDA on Linux (if auto-detect fails):
./run-voice-assistant.sh -provider cudaThe voice assistant can be configured as a real-time translator without changing a single line of code. By combining multilingual STT, strategic system prompts, and cross-language TTS, you can create a live translation device.
- Input Language (STT): Whisper transcribes speech in the source language
- Translation (LLM): System prompt instructs the model to translate to target language
- Output Language (TTS): Kokoro synthesizes the translation in the target language
- Temperature Control: Lower temperature (0.1-0.3) ensures deterministic, accurate translations
# 1. Pull a multilingual LLM (one time)
ollama pull qwen2.5:3b
# 2. Run the translator
./voice-assistant \
--ollama-model qwen2.5:3b \
--stt-language es \
--tts-voice af_bella \
--tts-speaker-id 2 \
--temperature 0.2 \
--system-prompt "You are a Spanish-to-English translator. Translate the following Spanish text to natural English. Output only the English translation without any Spanish words or explanations. NEVER use markdown, asterisks, underscores, backticks, brackets, code blocks, bullet points, or special characters."What happens:
- You speak in Spanish: "Hola, ¿cómo estás?"
- Whisper transcribes: "Hola, ¿cómo estás?"
- Qwen translates: "Hello, how are you?"
- Kokoro speaks in English: "Hello, how are you?"
./voice-assistant \
--ollama-model qwen2.5:3b \
--stt-language en \
--tts-voice ef_dora \
--tts-speaker-id 28 \
--temperature 0.2 \
--system-prompt "You are an English-to-Spanish translator. Translate the following English text to natural Spanish. Output only the Spanish translation without any English words or explanations. NEVER use markdown, asterisks, underscores, backticks, brackets, code blocks, bullet points, or special characters."French → English:
./voice-assistant \
--ollama-model qwen2.5:3b \
--stt-language fr \
--tts-voice af_bella \
--tts-speaker-id 2 \
--temperature 0.2 \
--system-prompt "You are a French-to-English translator. Translate the following French text to natural English. Output only the English translation. NEVER use markdown or special formatting."Japanese → English:
./voice-assistant \
--ollama-model qwen2.5:3b \
--stt-language ja \
--tts-voice af_bella \
--tts-speaker-id 2 \
--temperature 0.2 \
--system-prompt "You are a Japanese-to-English translator. Translate the following Japanese text to natural English. Output only the English translation. NEVER use markdown or special formatting."| Parameter | Purpose | Translation Value |
|---|---|---|
--stt-language |
Source language for transcription | es, fr, ja, etc. |
--tts-voice |
Target language voice | af_bella (English), ef_dora (Spanish), etc. |
--temperature |
Translation consistency | 0.1-0.3 (lower = more deterministic) |
--system-prompt |
Translation instructions | Must explicitly state "translate only" |
--ollama-model |
Multilingual model | qwen2.5:3b or aya-expanse:8b |
- Temperature 0.7 (default): Model may mix languages or add conversational elements
- Example: "Hola, ¿y tú? How are you?" (mixed Spanish/English)
- Temperature 0.2: Model provides deterministic, accurate translations
- Example: "Hello, how are you?" (pure English)
Lower temperature reduces creativity and increases consistency, which is ideal for translation tasks.
| Model | Size | Best For | Translation Quality |
|---|---|---|---|
| qwen2.5:3b ⭐ | ~2GB | General translation | Excellent |
| aya-expanse:8b | ~4.9GB | Best quality | Superior (purpose-built for multilingual) |
| qwen2.5:1.5b | ~1GB | Resource-constrained devices | Good |
The assistant supports two modes for managing playback interruption when speech is detected:
When using headsets (headphones + microphone), the system works perfectly: the microphone only captures your voice, so interrupting playback when you speak is straightforward.
However, with open mic/speaker setups (external speakers + separate microphone), the assistant's own voice output can be captured by the microphone, causing unwanted self-interruption. This is known as acoustic feedback or acoustic echo.
Use the --interrupt-mode flag to select the appropriate behavior for your audio setup:
./voice-assistant -interrupt-mode always- Use when: Using headphones or headset
- Behavior: Immediately interrupts playback when speech is detected
- Advantage: Natural conversation flow, can interrupt the assistant mid-sentence
- Limitation: Will self-interrupt with open speakers (assistant's voice triggers VAD)
./voice-assistant -interrupt-mode wait- Use when: Using external speakers with separate microphone
- Behavior: Pauses microphone capture during playback, resumes after with configurable delay
- Advantage: Prevents acoustic feedback and self-interruption
- Limitation: Cannot interrupt assistant mid-sentence, must wait for response to complete
- Delay: Use
-post-playback-delay-ms 300to adjust resume delay (default 300ms)
# For headset users (natural interruption)
./voice-assistant -interrupt-mode always
# For open mic/speaker setup (prevent feedback)
./voice-assistant -interrupt-mode wait -post-playback-delay-ms 500
# Optimize audio buffer for wired/built-in audio (lower latency)
./voice-assistant -audio-buffer-ms 20
# Default buffer works best for Bluetooth devices (100ms)
./voice-assistant # Uses 100ms buffer by defaultThe audio buffer size affects latency and compatibility with different audio devices:
| Buffer Size | Best For | Latency | Notes |
|---|---|---|---|
| 100ms (default) | Bluetooth devices | Higher | Prevents distortion with AirPods, etc. |
| 20ms | Wired/USB/Built-in | Low | More responsive, real-time feel |
| 50ms | Mixed use | Medium | Balance between latency and stability |
Usage:
# For Bluetooth devices (default, recommended for AirPods)
./voice-assistant
# For wired or built-in audio
./voice-assistant -audio-buffer-ms 20Why this matters: Bluetooth audio has inherent latency (100-200ms), so using a small buffer (20ms) can cause audio underruns and distortion. The 100ms default matches Bluetooth's characteristics.
Why is this a problem?
- Voice activity detection (VAD) analyzes audio energy and spectral features
- The assistant's synthesized voice has similar characteristics to human speech
- Without isolation, VAD cannot distinguish between user speech and playback
Why not use echo cancellation?
- Acoustic Echo Cancellation (AEC) requires significant computational resources
- Cross-platform AEC libraries have varying quality and platform-specific implementations
- On Linux, system-level solutions (PipeWire, PulseAudio) can provide AEC with proper configuration
- On macOS, Core Audio's VoiceProcessingIO provides AEC but requires platform-specific integration
The wait mode provides a simple, reliable solution that works consistently across all platforms without additional computational overhead.
Kokoro TTS includes multiple voices in a single model. You can change voices using the -tts-voice and -tts-speaker-id flags:
American Voices:
| Name | Speaker ID | Quality | Description |
|---|---|---|---|
af_heart |
3 | A | American female, flagship voice |
af_bella |
2 | A- | American female, high quality (default) |
af_nicole |
6 | B- | American female, good quality |
af_sarah |
9 | C+ | American female, warm |
af_sky |
10 | C- | American female, youthful |
am_adam |
11 | F+ | American male, basic quality |
am_michael |
16 | C+ | American male, medium quality |
British Voices:
| Name | Speaker ID | Quality | Description |
|---|---|---|---|
bf_emma |
21 | B- | British female, recommended |
bf_isabella |
22 | C | British female, medium quality |
bm_george |
26 | C | British male, medium quality |
bm_lewis |
27 | D+ | British male, basic quality |
For a complete list of all 53 voices (including Spanish, French, Japanese, and more), run:
./voice-assistant --list-voices# Use British male voice (George)
./voice-assistant -tts-voice bm_george -tts-speaker-id 26
# Use American female voice (Nicole)
./voice-assistant -tts-voice af_nicole -tts-speaker-id 6To see all 53 available Kokoro voices with their speaker IDs, quality grades, and descriptions:
# List all voices
./voice-assistant --list-voices
# Get details for a specific voice
./voice-assistant --voice-info af_bellavoice-assistant/
├── cmd/
│ └── assistant/
│ └── main.go # Main entry point, pipeline orchestration
├── internal/
│ ├── audio/
│ │ ├── capture.go # Microphone audio capture (malgo)
│ │ └── playback.go # Audio playback with interrupt support
│ ├── config/
│ │ └── config.go # CLI flags and configuration
│ ├── llm/
│ │ └── client.go # Ollama API client
│ ├── setup/
│ │ ├── download.go # HTTP download and tar.bz2 extraction helpers
│ │ └── setup.go # --setup orchestration (model download & verification)
│ ├── sherpa/
│ │ ├── sherpa_darwin.go # macOS-specific sherpa-onnx bindings (CoreML)
│ │ └── sherpa_linux.go # Linux-specific sherpa-onnx bindings (CUDA)
│ ├── stt/
│ │ ├── stt.go # VoiceDetector, Transcriber interfaces + factory
│ │ ├── silero.go # Silero VAD implementation
│ │ ├── whisper.go # Whisper transcription implementation
│ │ └── processor.go # STT processing goroutine
│ └── tts/
│ ├── tts.go # Synthesizer interface + factory
│ ├── kokoro.go # Kokoro TTS implementation
│ ├── text.go # Sentence splitting utilities
│ └── processor.go # TTS playback pipeline goroutine
├── scripts/
│ └── build.sh # Build script with CUDA support
├── go.mod
└── README.md
| Component | Model | Size | Purpose |
|---|---|---|---|
| VAD | Silero-VAD | ~2MB | Speech boundary detection |
| STT | Whisper small (int8) | ~150MB | Speech recognition |
| TTS | Kokoro v1.0 | ~311MB | Expressive voice synthesis |
You can select a different STT model with --stt-model:
STT alternatives:
tiny- Fastest, ~5% WER (default)base- Balance of speed/accuracysmall- Higher accuracy, slower
TTS voices (Kokoro built-in):
af_bella(speaker ID 2) - American female, high quality (default)af_heart(speaker ID 3) - American female, flagship voicebf_emma(speaker ID 21) - British female, recommendedam_adam(speaker ID 11) - American male
For all 53 voices across 9 languages, run: ./voice-assistant --list-voices
This implementation uses OfflineRecognizer (batch processing) rather than OnlineRecognizer (streaming) because:
- VAD pre-segments audio: The Silero-VAD detects complete utterances before transcription
- Whisper accuracy: Whisper performs best on complete audio segments
- Practical latency: The VAD adds minimal delay (~250ms silence detection), and Whisper processes quickly on modern hardware
For even lower latency, you could:
- Use a streaming model (Zipformer, Paraformer) with OnlineRecognizer
- Reduce VAD silence threshold
- Use smaller Whisper model (tiny.en)
CoreML is Apple's machine learning framework that automatically leverages:
- Apple Neural Engine (ANE) on M1/M2/M3/M4 chips for maximum efficiency
- GPU acceleration on Intel Macs with discrete graphics
- CPU fallback when specialized hardware is unavailable
No additional installation required - CoreML is built into macOS.
NVIDIA CUDA enables GPU-accelerated inference on Linux, supporting both discrete GPUs and Jetson SOC devices.
Supported Hardware:
- Discrete NVIDIA GPUs (GTX 10xx series or newer)
- NVIDIA Jetson devices (Nano, Orin, Xavier, AGX)
Requirements for Discrete GPUs:
- NVIDIA Driver 450.80.02 or later
- CUDA Toolkit 11.0 or later
Installation (Ubuntu/Debian):
# Install NVIDIA driver (if not already installed)
sudo apt install nvidia-driver-535
# Install CUDA toolkit
sudo apt install nvidia-cuda-toolkit
# Verify installation
nvidia-smi
nvcc --versionJetson Devices: Jetson devices (Nano, Orin, Xavier) come with JetPack SDK which includes CUDA support out of the box. The auto-detection will recognize Jetson devices via:
/dev/nvhost-gpudevice/etc/nv_tegra_releasefile- Tegra identifiers in device tree
Important: The build script automatically selects the correct ONNX Runtime version based on your CUDA version. Different Jetson boards require different versions:
| Jetson Device | JetPack | CUDA Version | ONNX Runtime |
|---|---|---|---|
| Jetson Nano B01 | 4.x | CUDA 10.2 | 1.11.0 |
| Jetson Orin NX | 5.x | CUDA 11.4 | 1.16.0 |
| Jetson Orin (JetPack 6.x) | 6.x | CUDA 12.2 | 1.18.0 |
| Jetson Orin (JetPack 6.2+) | 6.2+ | CUDA 12.6 | 1.18.1 |
The build script detects your CUDA version automatically and downloads the matching ONNX Runtime.
Running on Jetson:
# Build with CUDA support
./scripts/build.sh --cuda
# Run using the wrapper script (recommended - sets up library paths)
./run-voice-assistant.sh
# Or run directly if paths are configured
./voice-assistantIf you see errors like libcublas.so.X: cannot open shared object file, it means there's a CUDA version mismatch. The wrapper script sets up LD_LIBRARY_PATH to help resolve this.
Optional: Install cuDNN for optimal performance:
# Download cuDNN from NVIDIA (requires account)
# https://developer.nvidia.com/cudnn
sudo dpkg -i cudnn-local-repo-*.deb
sudo apt update
sudo apt install libcudnn8The voice assistant will automatically detect CUDA availability and use GPU acceleration.
When developing this cross-platform project in VS Code, gopls (the Go language server) may show errors for platform-specific code that doesn't apply to your current OS. For example, on macOS you might see [linux,amd64] errors for Linux-specific imports.
This project uses Go build constraints (//go:build darwin / //go:build linux) to provide platform-specific sherpa-onnx bindings. By default, gopls may analyze files for all platforms, causing spurious errors for code that won't run on your current OS. Setting GOOS and GOARCH tells gopls to analyze only for your platform.
This project uses sherpa-onnx for speech processing. Version management differs by platform:
| Platform | How It Works | Version Check |
|---|---|---|
| macOS | Uses pre-built sherpa-onnx-go-macos bindings |
Automatic (handled by bindings) |
| Linux (CPU) | Uses pre-built sherpa-onnx-go-linux bindings |
Automatic (handled by bindings) |
| Linux (CUDA) | Compiles sherpa-onnx from source | Manual sync required |
For Linux CUDA builds, three files must stay in sync:
| File | What to Update | Current Value |
|---|---|---|
go.mod |
sherpa-onnx-go-linux and sherpa-onnx-go-macos versions |
v1.12.x |
scripts/build.sh |
SHERPA_VERSION variable |
v1.12.x |
The build script includes a sanity check that fails with clear instructions if versions mismatch.
The build script automatically selects the correct ONNX Runtime version based on your CUDA version:
| CUDA Version | ONNX Runtime | Use Case |
|---|---|---|
| 10.2.x | 1.11.0 | Jetson Nano (JetPack 4.x) |
| 11.4.x | 1.16.0 | Jetson Orin NX (JetPack 5.x) |
| 11.x | 1.16.0 | Generic CUDA 11 |
| 12.2.x | 1.18.0 | CUDA 12.2 with cuDNN8 |
| 12.6.x+ | 1.18.1 | JetPack 6.2+ (cuDNN9) |
| 12.x | 1.18.1 | Generic CUDA 12 |
-
Check for new releases:
- Visit sherpa-onnx releases
- Visit sherpa-onnx-go-linux
-
Update go.mod:
go get github.com/k2-fsa/sherpa-onnx-go-linux@vX.Y.Z go get github.com/k2-fsa/sherpa-onnx-go-macos@vX.Y.Z go mod tidy
-
Update build script:
- Edit
scripts/build.sh - Update
SHERPA_VERSION="vX.Y.Z"to match
- Edit
-
Test on macOS (easy path):
./scripts/build.sh ./voice-assistant
-
Test on Linux with CUDA:
./scripts/build.sh --clean --cuda ./run-voice-assistant.sh
- Watch for ABI mismatch errors or "Please compile with -DSHERPA_ONNX_ENABLE_GPU=ON" warnings
- Verify provider shows
cudanotcpu
-
If CUDA build fails:
- Check if the ONNX Runtime version mapping needs updating
- Review sherpa-onnx release notes for breaking changes
- The ONNX Runtime mapping in
scripts/build.shmay need new entries for newer CUDA versions
For CUDA builds on Linux, the runtime libraries are installed to ~/.voice-assistant/go/lib/. This enables portable deployment:
On the build machine:
./scripts/build.sh --cudaTo deploy to another machine:
- Copy the
~/.voice-assistant/godirectory (contains CUDA libraries) - Copy the
~/.voice-assistant/modelsdirectory (shared model files) - Copy the
voice-assistantbinary - Copy the
run-voice-assistant.shwrapper script
On the target machine:
# Ensure CUDA toolkit is installed, then run:
./run-voice-assistant.shThe wrapper script automatically sets LD_LIBRARY_PATH to find libraries in ~/.voice-assistant/go/lib/.
macOS Note: On macOS with CoreML, the Go binary is statically linked and doesn't require runtime libraries. Just copy the binary and ~/.voice-assistant/models/ directory.
- Run initial setup to download required models:
./voice-assistant --setup(use--forceto re-download if needed) - Ensure the model directory exists and is writable (default:
~/.voice-assistant/models/, or as set via--model-dir) - Check that model paths and any
--model-diroverride match your configuration - Verify sherpa-onnx is properly installed
- Start Ollama:
ollama serve - Load a model:
ollama run qwen2.5:1.5b - Check the host URL matches:
-ollama-host http://localhost:11434
- Check microphone permissions (macOS: System Preferences → Privacy → Microphone)
- Verify microphone is connected and working
- Try running with
-verboseto see audio processing logs
- Ensure CGO is enabled:
export CGO_ENABLED=1 - Install required system libraries for your platform
- Verify NVIDIA driver:
nvidia-smi - Check CUDA version:
nvcc --version - Try forcing CPU mode:
./voice-assistant -provider cpu - Ensure CUDA libraries are in
LD_LIBRARY_PATH
- Ensure macOS 10.13 or later
- Try forcing CPU mode:
./voice-assistant -provider cpu
This project builds upon the excellent work of many open-source libraries and models:
- sherpa-onnx - Speech recognition and synthesis framework (Apache-2.0)
- Silero VAD - Voice activity detection model (MIT)
- OpenAI Whisper - Multilingual speech recognition model (MIT)
- Kokoro - Expressive neural text-to-speech model (MIT/Apache-2.0)
- Ollama - Local LLM inference engine (MIT)
Apache-2.0:
Unlicense:
BSD-3-Clause:
- Go standard library
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
All dependencies use permissive licenses (MIT, Apache-2.0, BSD-3-Clause, Unlicense) that are compatible with the Apache-2.0 License.