Skip to content

dpkdhingra91/pipecat-sarvam-azure-starter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pipecat-sarvam-azure-starter

A production-ready Pipecat voice-agent server: Sarvam STT → Azure OpenAI LLM → Sarvam TTS. Speaks 11 Indic languages + Indian English natively.

License: MIT Python 3.10+ Pipecat 0.0.50+ Docker PRs welcome

# 60 seconds to a working Hindi voice bot:
git clone https://github.com/dpkdhingra91/pipecat-sarvam-azure-starter
cd pipecat-sarvam-azure-starter
cp .env.example .env  &&  vi .env   # fill SARVAM_API_KEY + AZURE_OPENAI_*
docker compose up

# Open http://localhost:7860 → click Start → speak Hindi.

That's it. No JSON parsing, no protocol code, no torch download.


Why this exists

Sarvam's Indic STT/TTS is the best you can get for Hindi, Tamil, Telugu, Bengali, etc. — natural-sounding, low-latency, code-switching-aware. Azure OpenAI gives you GPT-4o-class LLMs with a relaxed RAI policy you can configure for spontaneous speech. Pipecat ties them together.

But: nobody has open-sourced this combination as a working starter. The Pipecat docs show Daily/Cartesia. Sarvam's docs show curl examples. You spend 2-3 days wiring up /connect handshake, turn-gate state machine, Azure timeout config, RTVI event routing, audio sample-rate alignment.

This repo is that work, finished and sanitized. Clone, drop in your API keys, talk to a Hindi voice bot in 60 seconds.


Architecture

  Browser                 FastAPI                Pipecat pipeline
  ┌─────────┐  POST       ┌─────────────┐       ┌──────────────────────────────┐
  │ index   │ /connect ──▶│ /connect    │       │ ws_transport.input()         │
  │ .html   │             │ /store-     │       │   ↓                          │
  │         │             │  context    │       │ Sarvam STT (16 kHz mono PCM) │
  │ @pipecat│             │ /health     │       │   ↓                          │
  │ /client │ WSS ───────▶│ /ws         │ ────▶ │ user_aggregator              │
  │ -js     │             └─────────────┘       │   ↓                          │
  │         │                                   │ ResilientAzureLLMService     │
  │         │ ◀───── 24 kHz PCM audio ──────────│   ↓ (12s timeout, 4 retries) │
  │         │ ◀───── RTVI envelopes (JSON) ─────│ Sarvam TTS (24 kHz mono PCM) │
  └─────────┘                                   │   ↓                          │
                                                │ ws_transport.output()        │
                                                │   ↓                          │
                                                │ BotSpeakingObserver ─▶ Gate  │
                                                └──────────────────────────────┘

A separate TurnGate state machine ensures the user can only speak when: (a) the client says its mic is ready, and (b) the client says the bot's audio has finished playing, and (c) we're not currently in a bot turn.

No echo cancellation problems. No "user interrupts bot before bot finishes" race.


Quickstart

1. Get API keys

Service Where Notes
Sarvam https://dashboard.sarvam.ai → API Keys Free tier covers a few hours of conversation
Azure OpenAI Azure portal → OpenAI resource → Keys + Endpoint Need a deployment name (e.g. gpt-4o-mini)

2. Configure

cp .env.example .env
# Edit .env — set SARVAM_API_KEY, AZURE_OPENAI_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT.

3. Run

# Docker (recommended)
docker compose up

# Or directly with Python
pip install -e .
python -m server.main

Open http://localhost:7860 — the included browser client picks a Hindi tutor persona by default. Click Start, allow mic access, talk.

4. Try the other examples

curl -s -X POST http://localhost:7860/connect \
  -H 'Content-Type: application/json' \
  -d @examples/english-support.json | jq

Examples

File Persona Language What it does
hindi-tutor.json tutor 🇮🇳 Hindi Conversational Hindi practice — corrects gently by modeling correct phrasing
english-support.json support 🇮🇳 Indian English SaaS customer-support agent with ticket-logging fallback
tamil-journal.json journal 🇮🇳 Tamil Voice journal companion — listens warmly, asks gentle follow-ups

Add your own persona in 3 lines of code — see examples/README.md.


What's included

Component What Source
Pipecat pipeline Sarvam STT → Azure LLM → Sarvam TTS, with turn-gate orchestration server/bot.py
ResilientAzureLLMService Azure OpenAI client with bounded timeout + 4 retries server/bot.py
TurnGate state machine Server-authoritative turn control — no echo, no race conditions server/bot.py
BotSpeakingObserver Drives the turn gate from Pipecat's frame stream server/bot.py
FastAPI server /connect, /store-context, /ws, /health endpoints server/main.py
Voice mapping 11 Indic languages + Indian English, per-language voice override via env server/voice_config.py
System prompt builder 3 example personas (tutor, support, journal) + extension point server/system_prompts.py
Browser client Plain HTML + ES modules, uses @pipecat-ai/client-js from CDN client/index.html
Reconnect support POST /store-context → resume conversation on fresh WebSocket server/main.py
Docker + compose One-command deploy Dockerfile

Languages supported

Out of the box: English (Indian), Hindi, Tamil, Telugu, Kannada, Malayalam, Bengali, Gujarati, Marathi, Punjabi, Odia.

Per-language voice override via TTS_VOICE_<CODE> env var. The default voice is shubh (Sarvam's friendly default v3 speaker) — swap to anushka, karun, etc. for variety.

TTS_VOICE_HI=anushka
TTS_VOICE_TA=karun

Service matrix

If you want to swap providers later:

Component Default Swap candidates Code touch
STT Sarvam saaras:v3 Deepgram, Google, Whisper, AssemblyAI, Azure Speech One line in bot.py
LLM Azure OpenAI OpenAI, Anthropic, Google Gemini, Together, vLLM One line + import
TTS Sarvam bulbul:v3 Cartesia, ElevenLabs, Azure Speech, OpenAI TTS One line in bot.py
Transport FastAPI WebSocket Daily WebRTC, LiveKit, Twilio Substantial — different framework

Pipecat has services for all of these. The Sarvam + Azure combo is what's wired here because that's what's missing from the OSS ecosystem.


Resilience built in

Failure mode What this repo does
Azure OpenAI stalls (>12s) First-token timeout fires, SDK retries up to 4x with exponential backoff
Azure RAI false-positive content filter Drop in pipecat-content-filter-fallback — 1 line
Sarvam TTS silently returns no audio Drop in pipecat-outbound-audio-counter — logs [tts_silent_fail]
Client WS drops mid-conversation POST /store-context saves messages, reconnect via /connect?sid=<hex> resumes
User starts speaking over the bot TurnGate keeps user lane closed until client confirms audio drained
Client never sends client:playback_drained Fallback timer (8s default) force-opens the gate

Configuration reference

All optional, all have sane defaults:

# Required
SARVAM_API_KEY=
AZURE_OPENAI_KEY=
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini

# Server
PORT=7860
LOG_LEVEL=INFO
CORS_ORIGINS=*

# Azure LLM resilience
LLM_REQUEST_TIMEOUT_S=12
LLM_CONNECT_TIMEOUT_S=3
LLM_MAX_RETRIES=4

# Turn-gate timing
POST_PLAYBACK_GRACE_MS=150
POST_PLAYBACK_FALLBACK_S=8.0

# Sarvam model overrides
SARVAM_STT_MODEL=saaras:v3
SARVAM_TTS_MODEL=bulbul:v3

# Per-language voice overrides
TTS_VOICE_HI=anushka
TTS_VOICE_TA=karun
# ... TTS_VOICE_TE, TTS_VOICE_KN, etc.

Why this vs. alternatives

Option Setup time Indic-language support Vendor lock-in Customization
This repo 60 sec to docker run ★★★ Native, all 11 Indic langs None (swap providers in 1 line) Full source, MIT
Vapi 5 min to dashboard ★ English mostly, limited Indic Heavy — closed source, hosted only Limited (templates only)
Retell AI 5 min to dashboard ★ English mostly Heavy — closed source, hosted only Limited
Raw Pipecat + write your own 2–3 days Depends on your STT/TTS choice None Full
LiveKit Agents 30 min ★★ Via Cartesia/Deepgram Low Full but bigger framework

The closed-source platforms (Vapi, Retell) are great if you want a hosted dashboard and don't care about source. This repo is for the case where you're going to deploy this on your own VM with your own keys and your own data flow — and you want to read every line of pipeline code.


Related projects

I extracted a few specialized pieces of this codebase into their own repos. Mix and match:


Roadmap

  • Daily WebRTC transport variant (for browser-without-ws-proxy deploys)
  • Per-turn cost telemetry (token + audio second accounting)
  • Optional Whisper post-pass for higher-quality transcripts
  • Test suite via voice-agent-qa
  • Cartesia + OpenAI variant (pipecat-cartesia-openai-starter)

Star + watch if any of these matter to you — PRs welcome.


License

MIT — see LICENSE.

Credits

  • Pipecat — the framework that makes all of this possible.
  • Sarvam AI — for the only Indic STT/TTS that actually sounds natural.
  • Built and battle-tested in production at AI Interview Agents.

About

Production-ready Pipecat voice-agent server: Sarvam STT + Azure OpenAI + Sarvam TTS. Indic-language native. 60 sec to docker run.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors