A real-time voice agent running entirely inside a Durable Object. Talk to an AI assistant that can answer questions, set spoken reminders, and check the weather — with streaming responses, interruption support, and conversation memory across sessions.
Uses Workers AI for all models — zero external API keys required:
- STT: Deepgram Nova 3 (
@cf/deepgram/nova-3) - TTS: Deepgram Aura (
@cf/deepgram/aura-1) - VAD: Pipecat Smart Turn v2 (
@cf/pipecat-ai/smart-turn-v2) - LLM: Kimi K2.5 (
@cf/moonshotai/kimi-k2.5)
npm install
npm run devNo API keys needed — all AI models run via the Workers AI binding.
Browser Durable Object (VoiceAgent)
┌──────────┐ binary WS frames ┌──────────────────────────┐
│ Mic PCM │ ────────────────────► │ Audio Buffer │
│ (16kHz) │ │ ↓ │
│ │ JSON: end_of_speech │ VAD (smart-turn-v2) │
│ │ ────────────────────► │ ↓ │
│ │ │ STT (nova-3) │
│ │ JSON: transcript │ ↓ │
│ │ ◄──────────────────── │ LLM (kimi-k2.5) │
│ │ binary: MP3 audio │ ↓ (sentence chunking) │
│ Speaker │ ◄──────────────────── │ TTS (aura-1, streaming) │
└──────────┘ └──────────────────────────┘
single WebSocket connection
- Browser captures mic audio via AudioWorklet, downsamples to 16kHz mono PCM
- PCM streams to the Agent over the existing WebSocket connection (binary frames)
- Client-side silence detection (500ms) triggers end-of-speech
- Server-side VAD (smart-turn-v2) confirms the user finished speaking
- Agent runs the voice pipeline: STT → LLM (with tools) → streaming TTS
- TTS audio streams back per-sentence as MP3 while the LLM is still generating
- Browser decodes and plays audio; user can interrupt at any time
- Streaming TTS — LLM output is split into sentences and synthesized concurrently, so the user hears the first sentence while the rest is still being generated.
- Interruption handling — speak over the agent to cut it off mid-sentence. The client detects sustained speech during playback and aborts the server pipeline.
- Server-side VAD —
smart-turn-v2validates end-of-speech after client silence detection, reducing false triggers on mid-sentence pauses. - Conversation persistence — all messages are stored in SQLite and survive restarts. The agent remembers previous conversations.
- Agent tools — the LLM can call
get_current_time,set_reminder, andget_weatherduring conversation. - Proactive scheduling — reminders set via voice fire on schedule and are spoken to connected clients (or saved to history if disconnected).
useVoiceAgenthook — the client uses theagents/voice-reacthook, which encapsulates all audio infrastructure in ~10 lines of setup.