A production-ready Pipecat voice-agent server: Sarvam STT → Azure OpenAI LLM → Sarvam TTS. Speaks 11 Indic languages + Indian English natively.
# 60 seconds to a working Hindi voice bot:
git clone https://github.com/dpkdhingra91/pipecat-sarvam-azure-starter
cd pipecat-sarvam-azure-starter
cp .env.example .env && vi .env # fill SARVAM_API_KEY + AZURE_OPENAI_*
docker compose up
# Open http://localhost:7860 → click Start → speak Hindi.That's it. No JSON parsing, no protocol code, no torch download.
Sarvam's Indic STT/TTS is the best you can get for Hindi, Tamil, Telugu, Bengali, etc. — natural-sounding, low-latency, code-switching-aware. Azure OpenAI gives you GPT-4o-class LLMs with a relaxed RAI policy you can configure for spontaneous speech. Pipecat ties them together.
But: nobody has open-sourced this combination as a working starter. The Pipecat docs show Daily/Cartesia. Sarvam's docs show curl examples. You spend 2-3 days wiring up /connect handshake, turn-gate state machine, Azure timeout config, RTVI event routing, audio sample-rate alignment.
This repo is that work, finished and sanitized. Clone, drop in your API keys, talk to a Hindi voice bot in 60 seconds.
Browser FastAPI Pipecat pipeline
┌─────────┐ POST ┌─────────────┐ ┌──────────────────────────────┐
│ index │ /connect ──▶│ /connect │ │ ws_transport.input() │
│ .html │ │ /store- │ │ ↓ │
│ │ │ context │ │ Sarvam STT (16 kHz mono PCM) │
│ @pipecat│ │ /health │ │ ↓ │
│ /client │ WSS ───────▶│ /ws │ ────▶ │ user_aggregator │
│ -js │ └─────────────┘ │ ↓ │
│ │ │ ResilientAzureLLMService │
│ │ ◀───── 24 kHz PCM audio ──────────│ ↓ (12s timeout, 4 retries) │
│ │ ◀───── RTVI envelopes (JSON) ─────│ Sarvam TTS (24 kHz mono PCM) │
└─────────┘ │ ↓ │
│ ws_transport.output() │
│ ↓ │
│ BotSpeakingObserver ─▶ Gate │
└──────────────────────────────┘
A separate TurnGate state machine ensures the user can only speak when:
(a) the client says its mic is ready, and
(b) the client says the bot's audio has finished playing, and
(c) we're not currently in a bot turn.
No echo cancellation problems. No "user interrupts bot before bot finishes" race.
| Service | Where | Notes |
|---|---|---|
| Sarvam | https://dashboard.sarvam.ai → API Keys | Free tier covers a few hours of conversation |
| Azure OpenAI | Azure portal → OpenAI resource → Keys + Endpoint | Need a deployment name (e.g. gpt-4o-mini) |
cp .env.example .env
# Edit .env — set SARVAM_API_KEY, AZURE_OPENAI_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT.# Docker (recommended)
docker compose up
# Or directly with Python
pip install -e .
python -m server.mainOpen http://localhost:7860 — the included browser client picks a Hindi tutor persona by default. Click Start, allow mic access, talk.
curl -s -X POST http://localhost:7860/connect \
-H 'Content-Type: application/json' \
-d @examples/english-support.json | jq| File | Persona | Language | What it does |
|---|---|---|---|
hindi-tutor.json |
tutor | 🇮🇳 Hindi | Conversational Hindi practice — corrects gently by modeling correct phrasing |
english-support.json |
support | 🇮🇳 Indian English | SaaS customer-support agent with ticket-logging fallback |
tamil-journal.json |
journal | 🇮🇳 Tamil | Voice journal companion — listens warmly, asks gentle follow-ups |
Add your own persona in 3 lines of code — see examples/README.md.
| Component | What | Source |
|---|---|---|
| Pipecat pipeline | Sarvam STT → Azure LLM → Sarvam TTS, with turn-gate orchestration | server/bot.py |
| ResilientAzureLLMService | Azure OpenAI client with bounded timeout + 4 retries | server/bot.py |
| TurnGate state machine | Server-authoritative turn control — no echo, no race conditions | server/bot.py |
| BotSpeakingObserver | Drives the turn gate from Pipecat's frame stream | server/bot.py |
| FastAPI server | /connect, /store-context, /ws, /health endpoints |
server/main.py |
| Voice mapping | 11 Indic languages + Indian English, per-language voice override via env | server/voice_config.py |
| System prompt builder | 3 example personas (tutor, support, journal) + extension point | server/system_prompts.py |
| Browser client | Plain HTML + ES modules, uses @pipecat-ai/client-js from CDN |
client/index.html |
| Reconnect support | POST /store-context → resume conversation on fresh WebSocket |
server/main.py |
| Docker + compose | One-command deploy | Dockerfile |
Out of the box: English (Indian), Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Gujarati, Marathi, Punjabi, Odia.
Per-language voice override via TTS_VOICE_<CODE> env var. The default voice is shubh (Sarvam's friendly default v3 speaker) — swap to anushka, karun, etc. for variety.
TTS_VOICE_HI=anushka
TTS_VOICE_TA=karunIf you want to swap providers later:
| Component | Default | Swap candidates | Code touch |
|---|---|---|---|
| STT | Sarvam saaras:v3 | Deepgram, Google, Whisper, AssemblyAI, Azure Speech | One line in bot.py |
| LLM | Azure OpenAI | OpenAI, Anthropic, Google Gemini, Together, vLLM | One line + import |
| TTS | Sarvam bulbul:v3 | Cartesia, ElevenLabs, Azure Speech, OpenAI TTS | One line in bot.py |
| Transport | FastAPI WebSocket | Daily WebRTC, LiveKit, Twilio | Substantial — different framework |
Pipecat has services for all of these. The Sarvam + Azure combo is what's wired here because that's what's missing from the OSS ecosystem.
| Failure mode | What this repo does |
|---|---|
| Azure OpenAI stalls (>12s) | First-token timeout fires, SDK retries up to 4x with exponential backoff |
| Azure RAI false-positive content filter | Drop in pipecat-content-filter-fallback — 1 line |
| Sarvam TTS silently returns no audio | Drop in pipecat-outbound-audio-counter — logs [tts_silent_fail] |
| Client WS drops mid-conversation | POST /store-context saves messages, reconnect via /connect?sid=<hex> resumes |
| User starts speaking over the bot | TurnGate keeps user lane closed until client confirms audio drained |
Client never sends client:playback_drained |
Fallback timer (8s default) force-opens the gate |
All optional, all have sane defaults:
# Required
SARVAM_API_KEY=
AZURE_OPENAI_KEY=
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
# Server
PORT=7860
LOG_LEVEL=INFO
CORS_ORIGINS=*
# Azure LLM resilience
LLM_REQUEST_TIMEOUT_S=12
LLM_CONNECT_TIMEOUT_S=3
LLM_MAX_RETRIES=4
# Turn-gate timing
POST_PLAYBACK_GRACE_MS=150
POST_PLAYBACK_FALLBACK_S=8.0
# Sarvam model overrides
SARVAM_STT_MODEL=saaras:v3
SARVAM_TTS_MODEL=bulbul:v3
# Per-language voice overrides
TTS_VOICE_HI=anushka
TTS_VOICE_TA=karun
# ... TTS_VOICE_TE, TTS_VOICE_KN, etc.| Option | Setup time | Indic-language support | Vendor lock-in | Customization |
|---|---|---|---|---|
| This repo | 60 sec to docker run | ★★★ Native, all 11 Indic langs | None (swap providers in 1 line) | Full source, MIT |
| Vapi | 5 min to dashboard | ★ English mostly, limited Indic | Heavy — closed source, hosted only | Limited (templates only) |
| Retell AI | 5 min to dashboard | ★ English mostly | Heavy — closed source, hosted only | Limited |
| Raw Pipecat + write your own | 2–3 days | Depends on your STT/TTS choice | None | Full |
| LiveKit Agents | 30 min | ★★ Via Cartesia/Deepgram | Low | Full but bigger framework |
The closed-source platforms (Vapi, Retell) are great if you want a hosted dashboard and don't care about source. This repo is for the case where you're going to deploy this on your own VM with your own keys and your own data flow — and you want to read every line of pipeline code.
I extracted a few specialized pieces of this codebase into their own repos. Mix and match:
- 🔌
voice-agent-qa— Python client to drive Pipecat servers programmatically. Use it for nightly smoke tests of this starter. - 🛡️
pipecat-content-filter-fallback— Catches Azure OpenAI RAI false positives and replaces them with a fallback turn. - 💾
pipecat-transcript-checkpoint— Per-turn transcript persistence. Survives dropped calls. - 📡
pipecat-ws-protocol-docs— Independent reference for the Pipecat WebSocket protocol (for client implementers in Python/Go/Rust). - 🎙️
pipecat-bot-speaking-observer— Standalone version of the turn-gate observer. - 📊
pipecat-outbound-audio-counter— TTS silent-failure detector.
- Daily WebRTC transport variant (for browser-without-ws-proxy deploys)
- Per-turn cost telemetry (token + audio second accounting)
- Optional Whisper post-pass for higher-quality transcripts
- Test suite via
voice-agent-qa - Cartesia + OpenAI variant (
pipecat-cartesia-openai-starter)
Star + watch if any of these matter to you — PRs welcome.
MIT — see LICENSE.
- Pipecat — the framework that makes all of this possible.
- Sarvam AI — for the only Indic STT/TTS that actually sounds natural.
- Built and battle-tested in production at AI Interview Agents.