Skip to content

Latest commit

 

History

History

README.md

Voice Agent

A real-time voice agent running entirely inside a Durable Object. Talk to an AI assistant that can answer questions, set spoken reminders, and check the weather — with streaming responses, interruption support, and conversation memory across sessions.

Uses Workers AI for all models — zero external API keys required:

  • STT: Deepgram Nova 3 (@cf/deepgram/nova-3)
  • TTS: Deepgram Aura (@cf/deepgram/aura-1)
  • VAD: Pipecat Smart Turn v2 (@cf/pipecat-ai/smart-turn-v2)
  • LLM: Kimi K2.5 (@cf/moonshotai/kimi-k2.5)

Run it

npm install
npm run dev

No API keys needed — all AI models run via the Workers AI binding.

How it works

Browser                          Durable Object (VoiceAgent)
┌──────────┐   binary WS frames   ┌──────────────────────────┐
│ Mic PCM  │ ────────────────────► │ Audio Buffer             │
│ (16kHz)  │                       │   ↓                      │
│          │   JSON: end_of_speech │ VAD (smart-turn-v2)      │
│          │ ────────────────────► │   ↓                      │
│          │                       │ STT (nova-3)             │
│          │   JSON: transcript    │   ↓                      │
│          │ ◄──────────────────── │ LLM (kimi-k2.5)      │
│          │   binary: MP3 audio   │   ↓ (sentence chunking)  │
│ Speaker  │ ◄──────────────────── │ TTS (aura-1, streaming)  │
└──────────┘                       └──────────────────────────┘
              single WebSocket connection
  1. Browser captures mic audio via AudioWorklet, downsamples to 16kHz mono PCM
  2. PCM streams to the Agent over the existing WebSocket connection (binary frames)
  3. Client-side silence detection (500ms) triggers end-of-speech
  4. Server-side VAD (smart-turn-v2) confirms the user finished speaking
  5. Agent runs the voice pipeline: STT → LLM (with tools) → streaming TTS
  6. TTS audio streams back per-sentence as MP3 while the LLM is still generating
  7. Browser decodes and plays audio; user can interrupt at any time

Features

  • Streaming TTS — LLM output is split into sentences and synthesized concurrently, so the user hears the first sentence while the rest is still being generated.
  • Interruption handling — speak over the agent to cut it off mid-sentence. The client detects sustained speech during playback and aborts the server pipeline.
  • Server-side VADsmart-turn-v2 validates end-of-speech after client silence detection, reducing false triggers on mid-sentence pauses.
  • Conversation persistence — all messages are stored in SQLite and survive restarts. The agent remembers previous conversations.
  • Agent tools — the LLM can call get_current_time, set_reminder, and get_weather during conversation.
  • Proactive scheduling — reminders set via voice fire on schedule and are spoken to connected clients (or saved to history if disconnected).
  • useVoiceAgent hook — the client uses the agents/voice-react hook, which encapsulates all audio infrastructure in ~10 lines of setup.