Name	Name	Last commit message	Last commit date
parent directory ..
public	public
src	src
.env.example	.env.example
.gitignore	.gitignore
README.md	README.md
env.d.ts	env.d.ts
index.html	index.html
package.json	package.json
tsconfig.json	tsconfig.json
vite.config.ts	vite.config.ts
wrangler.jsonc	wrangler.jsonc

Voice Agent

A real-time voice agent running entirely inside a Durable Object. Talk to an AI assistant that can answer questions, set spoken reminders, and check the weather — with streaming responses, interruption support, and conversation memory across sessions.

Uses Workers AI for all models — zero external API keys required:

STT: Deepgram Nova 3 (@cf/deepgram/nova-3)
TTS: Deepgram Aura (@cf/deepgram/aura-1)
VAD: Pipecat Smart Turn v2 (@cf/pipecat-ai/smart-turn-v2)
LLM: Kimi K2.5 (@cf/moonshotai/kimi-k2.5)

Run it

npm install
npm run dev

No API keys needed — all AI models run via the Workers AI binding.

How it works

Browser                          Durable Object (VoiceAgent)
┌──────────┐   binary WS frames   ┌──────────────────────────┐
│ Mic PCM  │ ────────────────────► │ Audio Buffer             │
│ (16kHz)  │                       │   ↓                      │
│          │   JSON: end_of_speech │ VAD (smart-turn-v2)      │
│          │ ────────────────────► │   ↓                      │
│          │                       │ STT (nova-3)             │
│          │   JSON: transcript    │   ↓                      │
│          │ ◄──────────────────── │ LLM (kimi-k2.5)      │
│          │   binary: MP3 audio   │   ↓ (sentence chunking)  │
│ Speaker  │ ◄──────────────────── │ TTS (aura-1, streaming)  │
└──────────┘                       └──────────────────────────┘
              single WebSocket connection

Browser captures mic audio via AudioWorklet, downsamples to 16kHz mono PCM
PCM streams to the Agent over the existing WebSocket connection (binary frames)
Client-side silence detection (500ms) triggers end-of-speech
Server-side VAD (smart-turn-v2) confirms the user finished speaking
Agent runs the voice pipeline: STT → LLM (with tools) → streaming TTS
TTS audio streams back per-sentence as MP3 while the LLM is still generating
Browser decodes and plays audio; user can interrupt at any time

Features

Streaming TTS — LLM output is split into sentences and synthesized concurrently, so the user hears the first sentence while the rest is still being generated.
Interruption handling — speak over the agent to cut it off mid-sentence. The client detects sustained speech during playback and aborts the server pipeline.
Server-side VAD — smart-turn-v2 validates end-of-speech after client silence detection, reducing false triggers on mid-sentence pauses.
Conversation persistence — all messages are stored in SQLite and survive restarts. The agent remembers previous conversations.
Agent tools — the LLM can call get_current_time, set_reminder, and get_weather during conversation.
Proactive scheduling — reminders set via voice fire on schedule and are spoken to connected clients (or saved to history if disconnected).
useVoiceAgent hook — the client uses the agents/voice-react hook, which encapsulates all audio infrastructure in ~10 lines of setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Voice Agent

Run it

How it works

Features

FilesExpand file tree

voice-agent

Directory actions

More options

Directory actions

More options

Latest commit

History

voice-agent

Folders and files

parent directory

README.md

Voice Agent

Run it

How it works

Features