A containerized AI platform that creates and interacts with digital twin personas using RAG (Retrieval-Augmented Generation), local speech-to-text transcription, voice cloning, and multi-language support.
- AI Personas: Create and interact with customizable AI personas with distinct personalities
- Local Speech-to-Text: Offline Whisper-based transcription (English + Urdu)
- Voice Cloning: Clone voices and generate speech using AllVoiceLab API
- RAG System: Retrieval-Augmented Generation for context-aware responses
- Agentic AI Models: Two intelligent agents for advanced document research and conversational planning
- Multi-turn Chat: Maintain conversation history and context
- Multi-language Support: English and Urdu language detection and processing
- Document Management: Upload and ingest documents for RAG
- Text-to-Speech: Generate speech with cloned voices
- Backend: FastAPI (Python)
- Frontend: Streamlit
- Speech Processing: OpenAI Whisper (base model, CPU-optimized)
- Voice Cloning: AllVoiceLab API
- Vector Store: ChromaDB
- LLM Integration: Groq API (llama-3.3-70b-versatile)
- Database: SQLite
- Containerization: Docker & Docker Compose
- Audio Capture: PyAudio with PortAudio
Echo-Persona/
├── echo/ # Main application
│ ├── app/
│ │ ├── main.py # FastAPI entry point
│ │ ├── api/ # API route handlers
│ │ │ ├── chat.py # Chat endpoints
│ │ │ ├── documents.py # Document management
│ │ │ ├── personas.py # Persona CRUD operations
│ │ │ ├── speech.py # Speech-to-text endpoints
│ │ │ ├── voice.py # Voice cloning endpoints
│ │ │ └── agents.py # Agentic AI endpoints
│ │ ├── agents/ # Agentic AI models
│ │ │ ├── base_agent.py # Base agent class
│ │ │ ├── tools.py # Agent tools
│ │ │ ├── document_research_agent.py # Document research agent
│ │ │ └── conversational_planning_agent.py # Conversational planning agent
│ │ ├── core/ # Core utilities
│ │ │ ├── config.py # Configuration management
│ │ │ └── logging.py # Logging setup
│ │ ├── db/ # Database layer
│ │ │ ├── database.py # DB connection & session
│ │ │ ├── models.py # SQLAlchemy models (includes VoiceClone)
│ │ │ └── init_db.py # Database initialization
│ │ ├── models/
│ │ │ └── schemas.py # Pydantic models
│ │ ├── rag/ # RAG pipeline
│ │ │ ├── embeddings.py # Embedding generation
│ │ │ ├── generation.py # LLM response generation
│ │ │ ├── ingestion.py # Document ingestion
│ │ │ ├── pipeline.py # RAG pipeline orchestration
│ │ │ ├── retrieval.py # Vector store retrieval
│ │ │ └── vectorstore.py # ChromaDB integration
│ │ ├── speech/ # Speech processing
│ │ │ ├── transcriber.py # Whisper transcriber
│ │ │ ├── audio_capture.py # Microphone audio capture
│ │ │ └── streaming.py # Real-time streaming
│ │ └── voice/ # Voice cloning
│ │ └── allvoicelab_client.py # AllVoiceLab API client
│ ├── frontend/ # Streamlit UI
│ │ ├── app.py # Main Streamlit app
│ │ ├── speech_input.py # Speech input component
│ │ └── voice_cloning.py # Voice cloning component
│ ├── tests/ # Unit tests
│ ├── docker-compose.yml # Container orchestration
│ ├── Dockerfile # Multi-stage build
│ ├── requirements.txt # Python dependencies
│ ├── pytest.ini # Pytest configuration
│ ├── demo_voice.py # Voice cloning demo script
│ └── .env.example # Environment variables template
├── data/
│ └── chroma/ # ChromaDB vector store (persistent)
├── logs/ # Application logs
├── README.md # Main documentation
├── VOICE_CLONING_GUIDE.md # Detailed voice cloning guide
└── VOICE_CLONING_SUMMARY.md # Voice cloning implementation details
- Docker & Docker Compose (v2.0+)
- 4GB+ RAM available
- 10GB+ disk space (for models and containers)
- AllVoiceLab API key (for voice cloning)
- Environment variables configured (see below)
- Clone the repository
git clone <repository-url>
cd Echo-Persona- Configure environment variables
Create a .env file in the echo/ directory:
# LLM Configuration
LLM_PROVIDER=groq
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_API_KEY=your_groq_api_key_here
# Optional: OpenAI configuration (if using OpenAI instead of Groq)
OPENAI_API_KEY=your_openai_key_here
# Voice Cloning (AllVoiceLab)
ALLVOICELAB_API_KEY=your_allvoicelab_api_key_here
# Google Search API (for document retrieval)
GOOGLE_API_KEY=your_google_api_key_here
# Hugging Face API (for embeddings)
HUGGINGFACE_API_KEY=your_hf_api_key_here- Start the application
cd echo/
docker-compose up -dThe application will be available at:
- Frontend (Streamlit): http://localhost:8501
- API Documentation: http://localhost:8081/docs
- API: http://localhost:8081
- Access the application
- Open http://localhost:8501 in your browser
- Create a persona
- Start chatting or clone a voice!
-
Go to Create page
-
Fill in persona details:
- Name: Persona identifier
- Description: What they are known for
- Personality Traits: Key characteristics
- Speaking Style: How they communicate
- Background: Biography and context
- Knowledge Base: Upload documents or provide text
-
Click "Create Persona"
- Select a persona from dropdown
- Choose input method:
- Text: Type your message
- Voice: Record audio (English or Urdu)
- Press Enter or click Send
- The AI will respond with context from the knowledge base
- Use the Stop button to interrupt generation
-
Get API Key
- Visit: https://allvoicelab.com
- Sign up and get your free API key
-
Configure
- Add
ALLVOICELAB_API_KEY=your_api_key_hereto.env
- Add
-
Test
cd echo python demo_voice.py
- Go to the Voice Cloning page
- Upload voice sample: Provide a 10-30 second clear audio file (WAV, MP3, M4A, OGG)
- Clone voice: Enter voice name and click "Clone Voice"
- Generate speech:
- Enter text you want to speak
- Select the cloned voice
- Adjust settings (speed, stability, similarity)
- Click "Generate Speech"
- Listen & Download: Play audio in browser or download MP3/WAV
- Multiple voice samples per persona
- Adjustable speed (0.5 - 2.0x)
- Stability slider (0 - 1)
- Similarity enhancement
- Multi-format output (MP3, WAV)
- In-browser playback
- Download generated audio
- Go to the Documents page
- Upload PDFs or text files
- Documents are automatically ingested into ChromaDB
- Context is retrieved during conversations
- Microphone Recording: Click to start/stop recording
- Language Selection: Auto-detect or select English/Urdu
- Transcription: Click "Transcribe" to convert speech to text
- Real-time Display: See transcribed text with confidence score
Echo now includes two powerful agentic AI models that use tools and multi-step reasoning:
Intelligently researches and synthesizes information from documents using:
- Multi-step research: Plans research strategy and executes multiple iterations
- Multi-query search: Uses multiple related queries for comprehensive coverage
- Content analysis: Analyzes document content for specific information
- Synthesis: Combines information from multiple sources into coherent answers
Use Cases:
- Complex questions requiring information from multiple documents
- Research tasks that need thorough investigation
- Questions that benefit from multiple search angles
API Endpoint: POST /api/agents/document-research
Example Request:
{
"persona_id": 1,
"message": "What are the main themes in my documents about machine learning?",
"session_id": "optional_session_id"
}Plans multi-step conversations and uses tools strategically:
- Conversation planning: Analyzes context and plans responses
- Intelligent tool selection: Decides when to search for information
- Context-aware responses: Maintains natural conversation flow
- Persona voice: Responds in the persona's authentic style
Use Cases:
- Natural conversations that may need document lookups
- Questions requiring context from previous messages
- Maintaining persona personality while accessing knowledge base
API Endpoint: POST /api/agents/conversational-planning
Example Request:
{
"persona_id": 1,
"message": "Tell me about my favorite hobbies",
"session_id": "conversation_session_123"
}Both agents feature:
- Tool-based execution: Use document search and analysis tools
- Multi-step reasoning: Plan and execute complex tasks
- Transparent reasoning: Provide step-by-step reasoning logs
- Error handling: Gracefully handle failures
- Performance tracking: Execution time and tool usage metrics
# Document Research Agent
curl -X POST "http://localhost:8080/api/agents/document-research" \
-H "Content-Type: application/json" \
-d '{
"persona_id": 1,
"message": "What are the key points about AI in my documents?"
}'
# Conversational Planning Agent
curl -X POST "http://localhost:8080/api/agents/conversational-planning" \
-H "Content-Type: application/json" \
-d '{
"persona_id": 1,
"message": "What did I say about my career goals?",
"session_id": "session_123"
}'
# List available agents
curl "http://localhost:8080/api/agents/available"cd echo/
docker-compose buildImage Sizes (CPU-optimized):
echo-api: ~3.3GBecho-frontend: ~3.3GBecho-speech-input: ~3.3GB
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down-
API Service (port 8081)
- FastAPI backend
- Health checks every 30s
- Uvicorn ASGI server
-
Frontend Service (port 8501)
- Streamlit web interface
- Depends on API health
-
Speech Input Service (port 5000)
- Optional microphone capture
- Real-time audio streaming
GET /health- Application health status
GET /api/personas- List all personasPOST /api/personas- Create new personaGET /api/personas/{id}- Get persona detailsPUT /api/personas/{id}- Update personaDELETE /api/personas/{id}- Delete persona
POST /api/chat/message- Send message to personaGET /api/chat/history- Get conversation historyDELETE /api/chat/session- Clear session
GET /api/speech/languages- Supported languagesPOST /api/speech/transcribe/file- Transcribe uploaded audioPOST /api/speech/transcribe- Transcribe base64 audioPOST /api/speech/detect-language- Detect audio languageWS /api/speech/ws/transcribe- WebSocket real-time transcription
POST /api/voice/clone- Clone voice from audio samplePOST /api/voice/tts- Generate speech with cloned voiceGET /api/voice/personas/{id}/voices- List cloned voices for personaDELETE /api/voice/voices/{id}- Delete cloned voiceGET /api/voice/health- Voice service health check
POST /api/documents/upload- Upload documentGET /api/documents- List documentsDELETE /api/documents/{id}- Delete document
POST /api/agents/document-research- Use Document Research Agent for intelligent document researchPOST /api/agents/conversational-planning- Use Conversational Planning Agent for context-aware conversationsGET /api/agents/available- List all available agentic AI models
- Interactive Swagger UI: http://localhost:8081/docs
- ReDoc: http://localhost:8081/redoc
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run API locally
cd echo
uvicorn app.main:app --reload
# In another terminal, run Streamlit
streamlit run frontend/app.pycd echo/
pytest
pytest --cov=app tests/ # With coveragecd echo/
python demo_voice.pyrequirements.txt: All Python dependenciesDockerfile: Multi-stage build for optimized imagesdocker-compose.yml: Service orchestrationapp/core/config.py: Configuration managementapp/rag/pipeline.py: RAG orchestrationapp/voice/allvoicelab_client.py: Voice cloning clientapp/agents/: Agentic AI models (Document Research, Conversational Planning)app/agents/tools.py: Reusable agent toolsfrontend/voice_cloning.py: Voice UI component
- Docker Hub account (free at https://hub.docker.com)
- Docker login configured
- Login to Docker Hub
docker login
# Enter your Docker Hub username and password- Tag images (replace
<username>with your Docker Hub username)
cd echo/
# Tag API image
docker tag echo-api <username>/echo-persona-api:latest
docker tag echo-api <username>/echo-persona-api:1.0.0
# Tag Frontend image
docker tag echo-frontend <username>/echo-persona-frontend:latest
docker tag echo-frontend <username>/echo-persona-frontend:1.0.0
# Tag Speech service image
docker tag echo-speech-input <username>/echo-persona-speech:latest
docker tag echo-speech-input <username>/echo-persona-speech:1.0.0- Push to Docker Hub
# Push API
docker push <username>/echo-persona-api:latest
docker push <username>/echo-persona-api:1.0.0
# Push Frontend
docker push <username>/echo-persona-frontend:latest
docker push <username>/echo-persona-frontend:1.0.0
# Push Speech service
docker push <username>/echo-persona-speech:latest
docker push <username>/echo-persona-speech:1.0.0- Go to https://hub.docker.com/r//
- Your images should be listed as public repositories
Your team members can pull and run the images:
# Pull images
docker pull <username>/echo-persona-api:latest
docker pull <username>/echo-persona-frontend:latest
docker pull <username>/echo-persona-speech:latest
# Create docker-compose.yml with pulled images
# (modify the image names in docker-compose.yml to point to your Docker Hub repos)
# Run the application
docker-compose up -d- Size: Base model (~139MB)
- Languages: 99 languages including English & Urdu
- Accuracy: ~80-90% depending on audio quality
- Speed: CPU ~30-60 seconds per minute of audio
- Provider: Groq API (free tier available)
- Model: llama-3.3-70b-versatile
- Context Window: 8K tokens
- Response Time: ~2-5 seconds (via Groq API)
- Provider: AllVoiceLab
- Voice Quality: High-quality natural speech
- Audio Sample Required: 10-30 seconds of clear audio
- Supported Formats: WAV, MP3, M4A, OGG
- Output Formats: MP3, WAV
- Processing Time: ~5-15 seconds per request
- Model: Hugging Face
sentence-transformers - Dimension: 384-768 dimensions
- Vector Store: ChromaDB with persistent storage
- Store in
.envfile (not committed to git) - Use environment variables in production
- Rotate keys regularly
- Keep AllVoiceLab API key confidential
- SQLite used for development (not production-ready)
- For production, migrate to PostgreSQL
- Currently no authentication (add as needed)
- For production, implement JWT or OAuth2
- Ensure compliance with voice cloning regulations
- Get consent before cloning someone's voice
- Use for authorized purposes only
- CPU-only PyTorch (reduced size from ~8.5GB to ~3.3GB)
- Multi-stage Docker builds
- Whisper base model (fastest among Whisper variants)
- ChromaDB in-memory caching
- AllVoiceLab cloud processing for voice cloning
- Use PostgreSQL instead of SQLite
- Add Redis caching layer
- Implement request rate limiting
- Add API authentication
- Use GPU for faster transcription/inference
- Cache generated voice files
# Restart frontend container
docker-compose restart frontend
# Check logs
docker-compose logs frontend- Increase timeout in
frontend/speech_input.py(currently 180s) - Whisper model takes time to load on first use
- Subsequent requests are faster
# Check API health
curl http://localhost:8081/health
# Check if API is running
docker-compose ps- Check AllVoiceLab API key in
.env - Use 10-30 second clear audio samples
- Supported formats: WAV, MP3, M4A, OGG
- Verify audio quality (minimize background noise)
- Set
ALLVOICELAB_API_KEYin.envfile - Restart backend:
docker-compose restart - Verify key is active on AllVoiceLab website
- Reduce Whisper model size or use
tinymodel - Close other applications
- Increase Docker memory allocation
- Voice cloning uses cloud processing (minimal local memory impact)
# LLM Provider (groq or openai)
LLM_PROVIDER=groq
# Groq Configuration
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_API_KEY=gsk_xxxxxxxxxxxxx
# OpenAI Configuration (optional)
OPENAI_API_KEY=sk-xxxxxxxxxxxxx
# Voice Cloning (AllVoiceLab)
ALLVOICELAB_API_KEY=your_api_key_here
# Google API (for search)
GOOGLE_API_KEY=xxxxxxxxxxxxx
# Hugging Face (for embeddings)
HUGGINGFACE_API_KEY=hf_xxxxxxxxxxxxx
# Database paths (in containers)
SQLITE_DATABASE_PATH=/app/data/echo.db
CHROMA_PERSIST_DIRECTORY=/app/data/chroma
UPLOAD_DIRECTORY=/app/data/uploads
- Create a feature branch:
git checkout -b feature/feature-name - Commit changes:
git commit -am 'Add feature' - Push to branch:
git push origin feature/feature-name - Submit pull request
For issues or questions:
- Check the troubleshooting section
- Review API documentation at http://localhost:8081/docs
- Check container logs:
docker-compose logs - For voice cloning issues, see VOICE_CLONING_GUIDE.md
- Agentic AI Models: Two intelligent agents for document research and conversational planning
- Document Research Agent: Multi-step research with tool-based execution
- Conversational Planning Agent: Context-aware conversations with intelligent tool selection
- Agent Tools: Reusable tools for document search, analysis, and multi-query search
- Agent API Endpoints: RESTful API for agent execution
- Transparent Reasoning: Step-by-step reasoning logs from agents
- Voice cloning with AllVoiceLab API
- Text-to-speech with cloned voices
- Voice management per persona
- Multiple output formats (MP3, WAV)
- Adjustable speech parameters (speed, stability, similarity)
- Audio file download functionality
- Basic persona creation and chat
- Local Whisper speech-to-text
- RAG with ChromaDB
- Multi-language support (English, Urdu)
- Docker containerization
- Streamlit UI with stop button
- CPU-optimized images (~3.3GB each)
Last Updated: December 8, 2025
