Skip to content

AkshitMaheshwari/VoiceBased_Ecomm

Repository files navigation

🛒 Voice-Based E-commerce Agent

An AI-powered voice calling agent for e-commerce that helps customers find products through natural phone conversations. Built with Twilio, ElevenLabs, Groq LLM, and ChromaDB.

Python Twilio ElevenLabs Groq


📖 Project Overview

This project is a Voice-based AI Shopping Assistant that allows customers to call a phone number and shop for products using natural conversation. Here's how we built it step-by-step:

🔄 Complete Pipeline

1️⃣ Data Collection

I sourced my product catalog from the Walmart Products Dataset:

  • 📦 Dataset: Walmart Dataset Samples
  • Contains product names, descriptions, prices, categories, and more
  • Raw CSV file stored in DataCleaning/walmart-products.csv

2️⃣ Data Cleaning & Preprocessing

The raw data needed cleaning before it could be used effectively:

  • Removed duplicate products and null values
  • Normalized price formats (removed $ symbols, converted to float)
  • Extracted brand names from product titles
  • Categorized products (Laptops, Smartphones, etc.)
  • Output: DataCleaning/cleaned_data.csv

3️⃣ Vector Database Setup (ChromaDB)

Converted cleaned data into searchable embeddings:

  • Used HuggingFace Embeddings (all-MiniLM-L6-v2) to create vector representations
  • Stored in ChromaDB with both content and metadata
  • Page Content: Product name + description (for semantic search)
  • Metadata: Price, Brand, Category (for filtering)

This allows us to search products semantically ("show me gaming laptops") while also filtering by budget, brand, or category.

4️⃣ Retriever with Smart Filtering

Built a hybrid retriever that merges semantic dense embeddings with BM25 sparse keyword searches using Reciprocal Rank Fusion (RRF):

  • User says "Samsung phone under 20000"
  • System extracts: brand=Samsung, category=Smartphone, budget=20000
  • Hybrid query: Dense vectors + BM25 index + $and filters
  • Automatically caches query embeddings and handles case-insensitive metadata variations.

5️⃣ Session Memory

Implemented two types of memory for natural conversations:

  • Conversation Memory: Remembers last 10 exchanges in the call
  • User Preferences: Tracks budget, brand, and category mentioned by user

6️⃣ LLM Engine (Groq)

Connected everything to a fast LLM for response generation:

  • Uses Groq with llama-3.1-8b-instant (super fast - 500+ tokens/sec)
  • Takes: Retrieved products + Conversation history + User preferences + Current query
  • Generates: Short, voice-friendly responses (1-2 sentences)

7️⃣ Voice & Web UI Integration

Final integration for multi-modal interaction:

  • Twilio Voice: Handles incoming calls and speech-to-text
  • ElevenLabs: Converts AI responses to natural human-like voice
  • Chat Web UI: A sleek, dark-themed frontend (static/index.html) using the new /api/chat endpoint.
  • FastAPI: Backend server that orchestrates Twilio webhooks, LLM queries, and the frontend server.

Result: Customer calls or types in the web interface → interacts naturally → Gets AI response synthesized back seamlessly! 🛒


🏗 Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        VOICE E-COMMERCE AGENT                        │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│  📞 TWILIO VOICE                                                     │
│  ├── Incoming call handling                                          │
│  ├── Speech-to-Text (built-in)                                       │
│  └── Webhook endpoints                                               │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│  🧠 PROCESSING PIPELINE                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐ │
│  │  Retriever  │→ │   Memory    │→ │  LLM Engine │→ │   Response  │ │
│  │  (ChromaDB) │  │ (Session)   │  │   (Groq)    │  │  Generation │ │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│  🎙 ELEVENLABS TTS                                                   │
│  ├── Natural voice synthesis                                         │
│  ├── Turbo model (low latency)                                       │
│  └── Audio streaming to Twilio                                       │
└─────────────────────────────────────────────────────────────────────┘

✨ Features

  • 📞 Phone-based shopping - Customers call and shop via voice
  • 🎯 Smart product search - Semantic search with filters (brand, category, budget)
  • 🧠 Conversation memory - Remembers context within the call
  • 👤 User preferences - Tracks budget, brand, category preferences
  • 🗣️ Natural voice - ElevenLabs for human-like responses
  • Low latency - Groq LLM (500+ tokens/sec) + ElevenLabs Turbo

🛠 Tech Stack

Component Technology Purpose
Voice Gateway Twilio Voice Phone calls, STT
Text-to-Speech ElevenLabs Natural voice synthesis
LLM Groq (Llama 3.1 8B) Response generation
Vector DB ChromaDB Product embeddings & search
Embeddings HuggingFace (all-MiniLM-L6-v2) Text embeddings
Backend FastAPI API server
Data Processing Pandas Data cleaning

🚀 Setup Guide

1. Clone & Install

git clone https://github.com/yourusername/voice_ecomm.git
cd voice_ecomm

# Using uv (recommended)
uv sync

# Or using pip
pip install -r requirements.txt

2. Environment Variables

Create .env file:

# LLM
GROQ_API_KEY=your_groq_api_key

# Voice
ELEVEN_LABS=your_elevenlabs_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key  # Required for streaming voice

# Embeddings
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

# Twilio (voice calls + browser voice)
TWILIO_ACCOUNT_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_auth_token
TWILIO_API_KEY_SID=your_twilio_api_key_sid
TWILIO_API_KEY_SECRET=your_twilio_api_key_secret
TWILIO_TWIML_APP_SID=your_twiml_app_sid

# Public base URL for Twilio callbacks / audio playback (ngrok or HTTPS domain)
PUBLIC_BASE_URL=https://your-ngrok-url.ngrok.io

3. Data Pipeline

Step 1: Data Cleaning

# Run the data cleaning notebook
jupyter notebook DataCleaning/file.ipynb

The cleaning process:

  • Removes duplicates and null values
  • Normalizes price formats
  • Extracts brand names
  • Categorizes products
  • Outputs cleaned_data.csv

Step 2: Ingest to ChromaDB

python Ingestion/chroma.py

This creates vector embeddings with metadata:

  • product_name - Product title
  • description - Product description
  • price - Numeric price (for filtering)
  • brand - Brand name (for filtering)
  • category - Product category (for filtering)

4. Run the Server

# Start FastAPI server
python app.py

5. Expose with Ngrok

# In another terminal
ngrok http 8000

6. Configure Twilio

  1. Go to Twilio Console
  2. Buy a phone number (or use trial)
  3. Configure Voice webhook:
    • URL: https://your-ngrok-url.ngrok.io/webhooks/voice/incoming
    • Method: POST
  4. Save and call the number!

7. Streaming Voice (Phone Calls)

If DEEPGRAM_API_KEY is set, incoming calls use Twilio Media Streams with real-time STT + streaming TTS:

  1. Ensure your public URL supports wss:// (ngrok works).
  2. Keep the Voice webhook pointing to:
    • https://your-ngrok-url.ngrok.io/webhooks/voice/incoming
  3. Twilio will open a WebSocket to:
    • wss://your-ngrok-url.ngrok.io/ws/twilio

8. Browser Voice (Web UI, no Twilio SDK)

The web voice button now uses your browser microphone + Deepgram streaming STT + ElevenLabs streaming TTS (no Twilio SDK). It requires:

  • DEEPGRAM_API_KEY
  • PUBLIC_BASE_URL (for audio URLs and Twilio callback parity)

Open the homepage and click Start live voice.

9. Twilio Browser Calls (Optional)

  1. Create a TwiML App in Twilio Console.
  2. Set the TwiML App Voice Request URL to:
    • https://your-ngrok-url.ngrok.io/webhooks/voice/incoming
  3. Create a Twilio API Key (not the Auth Token).
  4. Add these to .env:
    • TWILIO_API_KEY_SID, TWILIO_API_KEY_SECRET, TWILIO_TWIML_APP_SID
  5. Ensure PUBLIC_BASE_URL is set to the same public URL so Twilio can fetch /audio/....

Metadata stored:

Field Type Example Use
price float 499.99 Budget filtering ($lte)
brand string "Samsung" Brand filtering
category string "Laptop" Category filtering

Call Flow:

📞 Incoming Call
      │
      ▼
┌─────────────────┐
│  POST /         │ ──→ Welcome message (ElevenLabs)
└─────────────────┘
      │
      ▼
┌─────────────────┐
│ Twilio STT      │ ──→ User speaks, transcribed
└─────────────────┘
      │
      ▼
┌─────────────────┐
│ POST /process-  │
│ speech          │ ──→ Retrieve → LLM → ElevenLabs → Play
└─────────────────┘
      │
      ▼
   (Loop until hangup)

🔌 API Endpoints

Endpoint Method Description
/ POST Twilio webhook - incoming call
/incoming-call POST Alias for incoming call
/process-speech POST Process user speech, return AI response
/audio/{filename} GET Serve ElevenLabs audio files
/ws/twilio WebSocket Twilio Media Streams (streaming voice)
/ws/web-voice WebSocket Browser live voice (streaming)

🧪 Testing

Text-based Testing (without calling)

python test_pipeline.py

This tests:

  • ✅ Retriever with filters
  • ✅ LLM response generation
  • ✅ Memory updates
  • ✅ Interactive chat mode

Live Call Testing

  1. Start server: python app.py
  2. Start ngrok: ngrok http 8000
  3. Update Twilio webhook
  4. Call your Twilio number

📊 Performance

Component Latency
Twilio STT ~1s
ChromaDB Retrieval ~100ms
Groq LLM ~200ms
ElevenLabs TTS ~500ms
Total Response Time ~2s

📝 License

MIT License


🤝 Contributing

PRs welcome! Please read contributing guidelines first.

About

An AI-powered voice calling agent for e-commerce that helps customers find products through natural phone conversations. Built with Twilio, ElevenLabs, Groq LLM, and ChromaDB.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors