An AI-powered voice calling agent for e-commerce that helps customers find products through natural phone conversations. Built with Twilio, ElevenLabs, Groq LLM, and ChromaDB.
This project is a Voice-based AI Shopping Assistant that allows customers to call a phone number and shop for products using natural conversation. Here's how we built it step-by-step:
I sourced my product catalog from the Walmart Products Dataset:
- 📦 Dataset: Walmart Dataset Samples
- Contains product names, descriptions, prices, categories, and more
- Raw CSV file stored in
DataCleaning/walmart-products.csv
The raw data needed cleaning before it could be used effectively:
- Removed duplicate products and null values
- Normalized price formats (removed
$symbols, converted to float) - Extracted brand names from product titles
- Categorized products (Laptops, Smartphones, etc.)
- Output:
DataCleaning/cleaned_data.csv
Converted cleaned data into searchable embeddings:
- Used HuggingFace Embeddings (
all-MiniLM-L6-v2) to create vector representations - Stored in ChromaDB with both content and metadata
- Page Content: Product name + description (for semantic search)
- Metadata: Price, Brand, Category (for filtering)
This allows us to search products semantically ("show me gaming laptops") while also filtering by budget, brand, or category.
Built a hybrid retriever that merges semantic dense embeddings with BM25 sparse keyword searches using Reciprocal Rank Fusion (RRF):
- User says "Samsung phone under 20000"
- System extracts:
brand=Samsung,category=Smartphone,budget=20000 - Hybrid query: Dense vectors + BM25 index +
$andfilters - Automatically caches query embeddings and handles case-insensitive metadata variations.
Implemented two types of memory for natural conversations:
- Conversation Memory: Remembers last 10 exchanges in the call
- User Preferences: Tracks budget, brand, and category mentioned by user
Connected everything to a fast LLM for response generation:
- Uses Groq with
llama-3.1-8b-instant(super fast - 500+ tokens/sec) - Takes: Retrieved products + Conversation history + User preferences + Current query
- Generates: Short, voice-friendly responses (1-2 sentences)
Final integration for multi-modal interaction:
- Twilio Voice: Handles incoming calls and speech-to-text
- ElevenLabs: Converts AI responses to natural human-like voice
- Chat Web UI: A sleek, dark-themed frontend (
static/index.html) using the new/api/chatendpoint. - FastAPI: Backend server that orchestrates Twilio webhooks, LLM queries, and the frontend server.
Result: Customer calls or types in the web interface → interacts naturally → Gets AI response synthesized back seamlessly! 🛒
┌─────────────────────────────────────────────────────────────────────┐
│ VOICE E-COMMERCE AGENT │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 📞 TWILIO VOICE │
│ ├── Incoming call handling │
│ ├── Speech-to-Text (built-in) │
│ └── Webhook endpoints │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 🧠 PROCESSING PIPELINE │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Retriever │→ │ Memory │→ │ LLM Engine │→ │ Response │ │
│ │ (ChromaDB) │ │ (Session) │ │ (Groq) │ │ Generation │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 🎙 ELEVENLABS TTS │
│ ├── Natural voice synthesis │
│ ├── Turbo model (low latency) │
│ └── Audio streaming to Twilio │
└─────────────────────────────────────────────────────────────────────┘
- 📞 Phone-based shopping - Customers call and shop via voice
- 🎯 Smart product search - Semantic search with filters (brand, category, budget)
- 🧠 Conversation memory - Remembers context within the call
- 👤 User preferences - Tracks budget, brand, category preferences
- 🗣️ Natural voice - ElevenLabs for human-like responses
- ⚡ Low latency - Groq LLM (500+ tokens/sec) + ElevenLabs Turbo
| Component | Technology | Purpose |
|---|---|---|
| Voice Gateway | Twilio Voice | Phone calls, STT |
| Text-to-Speech | ElevenLabs | Natural voice synthesis |
| LLM | Groq (Llama 3.1 8B) | Response generation |
| Vector DB | ChromaDB | Product embeddings & search |
| Embeddings | HuggingFace (all-MiniLM-L6-v2) | Text embeddings |
| Backend | FastAPI | API server |
| Data Processing | Pandas | Data cleaning |
git clone https://github.com/yourusername/voice_ecomm.git
cd voice_ecomm
# Using uv (recommended)
uv sync
# Or using pip
pip install -r requirements.txtCreate .env file:
# LLM
GROQ_API_KEY=your_groq_api_key
# Voice
ELEVEN_LABS=your_elevenlabs_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key # Required for streaming voice
# Embeddings
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
# Twilio (voice calls + browser voice)
TWILIO_ACCOUNT_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_auth_token
TWILIO_API_KEY_SID=your_twilio_api_key_sid
TWILIO_API_KEY_SECRET=your_twilio_api_key_secret
TWILIO_TWIML_APP_SID=your_twiml_app_sid
# Public base URL for Twilio callbacks / audio playback (ngrok or HTTPS domain)
PUBLIC_BASE_URL=https://your-ngrok-url.ngrok.io# Run the data cleaning notebook
jupyter notebook DataCleaning/file.ipynbThe cleaning process:
- Removes duplicates and null values
- Normalizes price formats
- Extracts brand names
- Categorizes products
- Outputs
cleaned_data.csv
python Ingestion/chroma.pyThis creates vector embeddings with metadata:
product_name- Product titledescription- Product descriptionprice- Numeric price (for filtering)brand- Brand name (for filtering)category- Product category (for filtering)
# Start FastAPI server
python app.py# In another terminal
ngrok http 8000- Go to Twilio Console
- Buy a phone number (or use trial)
- Configure Voice webhook:
- URL:
https://your-ngrok-url.ngrok.io/webhooks/voice/incoming - Method: POST
- URL:
- Save and call the number!
If DEEPGRAM_API_KEY is set, incoming calls use Twilio Media Streams with real-time STT + streaming TTS:
- Ensure your public URL supports wss:// (ngrok works).
- Keep the Voice webhook pointing to:
https://your-ngrok-url.ngrok.io/webhooks/voice/incoming
- Twilio will open a WebSocket to:
wss://your-ngrok-url.ngrok.io/ws/twilio
The web voice button now uses your browser microphone + Deepgram streaming STT + ElevenLabs streaming TTS (no Twilio SDK). It requires:
DEEPGRAM_API_KEYPUBLIC_BASE_URL(for audio URLs and Twilio callback parity)
Open the homepage and click Start live voice.
- Create a TwiML App in Twilio Console.
- Set the TwiML App Voice Request URL to:
https://your-ngrok-url.ngrok.io/webhooks/voice/incoming
- Create a Twilio API Key (not the Auth Token).
- Add these to
.env:TWILIO_API_KEY_SID,TWILIO_API_KEY_SECRET,TWILIO_TWIML_APP_SID
- Ensure
PUBLIC_BASE_URLis set to the same public URL so Twilio can fetch/audio/....
Metadata stored:
| Field | Type | Example | Use |
|---|---|---|---|
price |
float | 499.99 | Budget filtering ($lte) |
brand |
string | "Samsung" | Brand filtering |
category |
string | "Laptop" | Category filtering |
Call Flow:
📞 Incoming Call
│
▼
┌─────────────────┐
│ POST / │ ──→ Welcome message (ElevenLabs)
└─────────────────┘
│
▼
┌─────────────────┐
│ Twilio STT │ ──→ User speaks, transcribed
└─────────────────┘
│
▼
┌─────────────────┐
│ POST /process- │
│ speech │ ──→ Retrieve → LLM → ElevenLabs → Play
└─────────────────┘
│
▼
(Loop until hangup)
| Endpoint | Method | Description |
|---|---|---|
/ |
POST | Twilio webhook - incoming call |
/incoming-call |
POST | Alias for incoming call |
/process-speech |
POST | Process user speech, return AI response |
/audio/{filename} |
GET | Serve ElevenLabs audio files |
/ws/twilio |
WebSocket | Twilio Media Streams (streaming voice) |
/ws/web-voice |
WebSocket | Browser live voice (streaming) |
python test_pipeline.pyThis tests:
- ✅ Retriever with filters
- ✅ LLM response generation
- ✅ Memory updates
- ✅ Interactive chat mode
- Start server:
python app.py - Start ngrok:
ngrok http 8000 - Update Twilio webhook
- Call your Twilio number
| Component | Latency |
|---|---|
| Twilio STT | ~1s |
| ChromaDB Retrieval | ~100ms |
| Groq LLM | ~200ms |
| ElevenLabs TTS | ~500ms |
| Total Response Time | ~2s |
MIT License
PRs welcome! Please read contributing guidelines first.