Self-hosted AI voice agent for DIDWW SIP trunks. Point a DIDWW phone number at your own server and callers talk to a Google Gemini Live agent — speech in, speech out, in real time.
This is a complete, working example of using DIDWW two-way SIP trunks and DIDs to build voice AI agents: SIP signalling, RTP media, codec negotiation, transcoding, and a realtime bridge to a multimodal LLM — all self-hosted, with no SaaS voice platform in the path.
Works out of the box. With a DIDWW trunk and a Gemini API key, a fresh clone answers real phone calls using a built-in demo agent — no database, no second service. The demo agent can even explain how it works while you're on the call. See docs/QUICKSTART.md.
PSTN caller
│ dials your DIDWW number
▼
DIDWW two-way SIP trunk
│ SIP INVITE + RTP media
▼
┌──────────────────────────────────────────────────┐
│ your server │
│ │
│ drachtio ─────▶ agent.js ──── WebSocket ───▶ Gemini Live
│ (SIP server) (Node.js) (Google)
│ │ │
│ └──▶ rtpengine ◀────────────┘
│ (RTP media / transcoding)
└──────────────────────────────────────────────────┘
- drachtio — SIP server, runs in Docker.
- rtpengine — RTP/SRTP media and transcoding, runs in Docker.
- Caddy — TLS reverse proxy with automatic Let's Encrypt certificates.
- Node.js agent — bridges RTP audio to the Gemini Live API over a WebSocket, and handles tools, transcripts and call control.
A plain G.711 inbound call is bridged by the Node agent directly; rtpengine is only needed for codec transcoding, outbound calls, conferences and WhatsApp. See docs/ARCHITECTURE.md.
- Inbound voice agent — Gemini Live, natural barge-in, tool calling, and a live transcript of the call.
- Built-in demo agent — answers calls with zero external services; edit one
file (
server/demo-config.js) to make it your own. - Outbound calling — place PSTN calls out through the trunk.
- 3-leg conferences — bridge two phones, with an optional AI co-listener.
- WhatsApp Business Calling bridge — optional, advanced.
- Pluggable per-caller config — connect your own prompt/tool service for dynamic, per-caller agents.
- Full host provisioning — reproduce the whole server from scratch on a fresh Ubuntu 24.04 host.
- Codec support — G.711, G.722, L16 out of the box; EVS / AMR-WB / AMR-NB via an optional custom rtpengine image.
- Diagnostics — a SIP echo test and a call-forward utility.
- Get a server (Ubuntu 24.04, public IPv4), a DIDWW DID + two-way SIP trunk, and a Google Gemini API key.
- Provision the host and deploy the agent —
provision/has the scripts. - Fill in
.env(only five values are needed for the demo). - Point your DIDWW number at the server and call it.
The step-by-step walkthrough is in docs/QUICKSTART.md; the DIDWW side is in docs/DIDWW-SETUP.md.
server/ Node.js application
agent.js inbound agent + outbound + conferences + control API
webhook.js WhatsApp / WABA media bridge
demo-config.js built-in demo agent (used when no config service is set)
echo-test.js RTP echo reflector — trunk smoke test
call-forward.js SIP B2BUA call forwarder
g722.js G.722 codec
rtpengine.js rtpengine NG-protocol client
trunk-register.js SIP REGISTER keep-alive (registration trunks only)
provision/ host provisioning for a fresh Ubuntu 24.04 server
10-base.sh · 20-harden.sh · 30-containers.sh · 40-ufw.sh
Caddyfile · drachtio.conf.xml · *.service · rtpengine-evs/Dockerfile
tools/ diagnostics (rtpengine bridge probe)
docs/ documentation
.env.example configuration template — copy to .env
- A server — Ubuntu 24.04 with a public IPv4. Around 1 vCPU / 1 GB RAM handles a handful of concurrent calls.
- A DIDWW account — at least one DID and a two-way SIP trunk. See docs/DIDWW-SETUP.md.
- A Google Gemini API key — from Google AI Studio.
- (Optional) a Deepgram API key for higher-quality live transcription.
| Doc | What it covers |
|---|---|
| QUICKSTART.md | Fastest path to a talking agent |
| DIDWW-SETUP.md | Buy a DID, create a two-way trunk, wire it up |
| ARCHITECTURE.md | How SIP, RTP and Gemini fit together |
| CONFIGURATION.md | Every environment variable |
| DEPLOYMENT.md | Full production deployment reference |
| ADVANCED.md | Outbound, conferences, WhatsApp, custom config service |
This stack terminates SIP and RTP from the public internet — telephony is a
target for toll fraud and abuse. The provisioning scripts firewall SIP/RTP to
your carrier's IP ranges, and the agent enforces per-call and concurrency caps.
Read SECURITY.md before exposing it, and generate fresh
secrets — the values in .env.example and provision/drachtio.conf.xml are
placeholders.
MIT — see LICENSE. See NOTICE for third-party components (drachtio, rtpengine, Caddy) and an important note on patent-encumbered codecs (EVS / AMR). Not affiliated with or endorsed by DIDWW, Google or Meta.