Talk to Your Docs is a production-grade Retrieval-Augmented Generation (RAG) microservice built for MLOps practitioners.
It ingests PDFs, cleans and chunks text, indexes embeddings into Qdrant, performs deep retrieval with FlashRank reranking, and uses an LLM (Groq / GPT-OSS-20B) to answer queries grounded in source documents.
- Langfuse v3 Support - Full compatibility with latest Langfuse SDK
- Prometheus + Grafana - Production monitoring stack
- Improved Architecture - Separated UI and API concerns
- Enhanced Docker Compose - Multi-service orchestration
- Better Error Handling - Graceful fallbacks for observability
Talk_to_Your_Docs_RAG_System/
βββ .github/
β βββ workflows/
β βββ ci.yml # GitHub Actions CI/CD
βββ evaluation/
β βββ evaluate.py # Ragas evaluation script
β βββ report.csv # Latest evaluation results
βββ images/ # Screenshots for README
βββ k8s/ # Kubernetes manifests
β βββ deployment.yaml
β βββ service.yaml
β βββ qdrant-statefulset.yaml
β βββ qdrant-pvc.yaml
βββ opt/ # FlashRank model cache
βββ qdrant_db/ # Local Qdrant persistence
βββ src/
β βββ app.py # FastAPI application (UPDATED v3)
β βββ config.py # Configuration (UPDATED)
β βββ ingestion.py # PDF processing (UPDATED v3)
β βββ main.py # FastAPI entry point
β βββ rag.py # RAG engine core (UPDATED v3)
βββ ui/
β βββ streamlit_app.py # Streamlit UI (UPDATED v3)
βββ tests/ # Unit tests
βββ .dockerignore
βββ .env # Environment variables
βββ .env.example
βββ .gitignore
βββ docker-compose.yml # Multi-service setup (UPDATED)
βββ Dockerfile # Python 3.11 image (UPDATED)
βββ Dockerfile.qdrant # Custom Qdrant image
βββ Makefile # Development commands (UPDATED)
βββ prometheus.yml # Prometheus config (NEW)
βββ requirements.txt # Dependencies (Langfuse v3)
βββ requirements-dev.txt # Dependencies (Local dev)
βββ README.md # This file
- π Python 3.11 - Main runtime
- β‘ FastAPI - REST API (
/chat,/ingest,/feedback,/health) - π Streamlit - Interactive UI for demos
- πΎ Qdrant - Vector database (port 6333)
- β‘ FlashRank - Cross-encoder reranker
- π€ LLM for generation:
- Groq β Ultra-fast inference platform
- GPT-OSS β LLM models
- π΅οΈ Langfuse v3 - Tracing & observability with compatibility layer
- π Prometheus - Metrics collection (port 9090)
- π Grafana - Metrics visualization (port 3000)
- π Ragas - Automated RAG evaluation
- π³ Docker Compose - Multi-container orchestration
- βΈοΈ Kubernetes - Production deployment
- π Page-aware PDF ingestion with metadata preservation
- π§Ή Intelligent text cleaning (hyphenation, citations, null bytes)
- π Chunk deduplication via MD5 hashing
- π§ Multi-query generation for better recall
- π Deep retrieval (k=50) + FlashRank reranking (top-7)
- π‘οΈ Strict prompt templates to reduce hallucinations
- π¬ Chat history support for conversational context
- π Trace IDs - Every answer links to Langfuse trace
- π Feedback loop - Thumbs up/down for continuous improvement
- π Prometheus metrics - Latency, throughput, errors
- π Grafana dashboards - Real-time monitoring
- βοΈ Background ingestion - Non-blocking PDF processing
- π Graceful fallbacks - Robust error handling
- Docker & Docker Compose
- Python 3.11+
- Groq API key (Get it here)
- Langfuse account (Sign up)
# 1. Clone repository
git clone <repo-url>
cd Talk_to_Your_Docs_RAG_System
# 2. Set up environment variables
cp .env.example .env
# Edit .env and add:
# - GROQ_API_KEY=gsk_...
# - LANGFUSE_PUBLIC_KEY=pk-lf-...
# - LANGFUSE_SECRET_KEY=sk-lf-...
# 3. Start all services
make up
# Or: docker compose up -d
# 4. Access services
# - Streamlit UI: http://localhost:8501
# - FastAPI docs: http://localhost:8000/docs
# - Prometheus: http://localhost:9090
# - Grafana: http://localhost:3000 (admin/admin)
# - Qdrant: http://localhost:6333# 1. Install dependencies
make install
# Or: uv venv && uv pip install -r requirements.txt
# 2. Activate virtual environment
source venv/bin/activate
# 3. Start Qdrant (in separate terminal)
docker run -p 6333:6333 qdrant/qdrant
# 4A. Run Streamlit UI
make ui
# Or: streamlit run ui/streamlit_app.py
# 4B. Run FastAPI
make dev
# Or: uvicorn src.main:app --reloadEdit src/config.py or use environment variables in .env:
GROQ_API_KEY=gsk_your_key_here
LANGFUSE_PUBLIC_KEY=pk-lf-your_key
LANGFUSE_SECRET_KEY=sk-lf-your_secretQDRANT_URL=http://localhost:6333
LANGFUSE_HOST=https://cloud.langfuse.com
COLLECTION_NAME=rag_documents
LLM_MODEL=openai/gpt-oss-20b
EMBEDDING_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
LOG_LEVEL=INFO- Upload PDF β Extract text per page
- Clean text β Remove hyphenation, null bytes, citations
- Split into chunks β RecursiveCharacterTextSplitter
- Generate hashes β MD5 for deduplication
- Index to Qdrant β Store embeddings with metadata
- Multi-query generation - Generate 3 variations of user query
- Deep retrieval - Fetch top-50 chunks per query from Qdrant
- FlashRank reranking - Cross-encoder reranks to top-7
- LLM generation - Generate answer grounded in context
- Trace capture - Return answer + trace_id for feedback
- Automatic tracing via
@observedecorators - Token counting - Input/output tokens tracked
- Latency tracking - Each step measured
- Feedback loop - Thumbs up/down linked to traces
Query the RAG system.
Request:
{
"query": "What is PDF?"
}Response:
{
"answer": "PDF stands for Portable Document Format...",
"trace_id": "trace-abc-123",
"sources": [
{
"text": "PDF was created by Adobe...",
"meta": {"source": "doc.pdf", "page": 1}
}
]
}Submit user feedback for a trace.
Request:
{
"trace_id": "trace-abc-123",
"score": 1.0,
"comment": "Helpful answer"
}Upload PDF for background processing.
Request:
curl -X POST http://localhost:8000/ingest \
-F "[email protected]"Health check endpoint.
Response:
{"status": "healthy"}Prometheus metrics endpoint.
The UI is designed for production workloads with:
- Custom boot sequence - Visual feedback during model loading
- Asynchronous ingestion - Non-blocking PDF processing
- Real-time feedback - Thumbs up/down integrated with Langfuse
- Source citations - Show page numbers and text snippets
Boot sequence with lazy loading of heavy models
Interactive chat with source citations
- Traces - Every RAG pipeline execution
- Scores - User feedback (thumbs up/down)
- Prompts - Version-controlled system prompts
- Analytics - Token usage, costs, latency
Key metrics exposed at /metrics:
http_requests_total- Total API callshttp_request_duration_seconds- Latency histogramhttp_requests_in_progress- Concurrent requests
Access Prometheus at http://localhost:9090
Pre-configured dashboards for:
- API latency (p50, p95, p99)
- Error rates
- Throughput (requests/sec)
- Qdrant performance
| Service | URL | Credentials |
|---|---|---|
| Streamlit UI | http://localhost:8501 | None |
| API Docs | http://localhost:8000/docs | None |
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | None |
Note
All services are intended to run locally. Grafana uses default credentials on first start; change them in production.
Access Grafana at http://localhost:3000 (admin/admin)
We use Ragas for checking quality and Weights & Biases for experiment tracking.
Run evaluation pipeline:
make eval
# Or:
# 1) - python evaluation/track_experiment.py
# 2) 1) - python evaluation/evaluate.pyTracked Experiment (with W&B)
| Metric | Score | Description |
|---|---|---|
| Faithfulness | 1.00 | Zero hallucinations |
| Context Precision | 1.00 | Perfect retrieval |
| Answer Relevancy | N/a | (Rate limited in free tier) or 0.83 without free tier |
Latest Results (evaluate.py):
| Metric | Score | Description |
|---|---|---|
| Faithfulness | 1.00 | Zero hallucinations |
| Context Precision | 1.00 | Perfect retrieval |
| Answer Relevancy | 0.67 | High alignment |
| Configuration | Recall | Precision | Hallucination Rate |
|---|---|---|---|
| Standard RAG | 68% | 72% | Low |
| Deep RAG + Rerank | 94% | 89% | Near Zero |
make install # Install dependencies
make dev # Run FastAPI with hot reload
make ui # Run Streamlit UI
make lint # Run ruff linter
make eval # Run evaluation pipelinemake build # Build Docker image
make up # Start all services
make down # Stop all services
make restart # Restart services
make rebuild # Rebuild from scratch
make logs # Tail all logs
make logs-api # Tail API logs
make logs-streamlit # Tail Streamlit logs
make ps # Show service statusmake clean-db # Delete Qdrant collectionmake k8s-deploy # Deploy to K8s
make k8s-delete # Remove from K8s
make k8s-logs # View K8s logs
make k8s-forward # Port forward servicemake clean # Remove Python caches
make clean-volumes # Remove Docker volumes
make clean-all # Complete cleanupservices:
qdrant: # Vector database (port 6333)
api: # FastAPI backend (port 8000)
streamlit: # Streamlit UI (port 8501)
prometheus: # Metrics collector (port 9090)
grafana: # Dashboards (port 3000)All services are networked and auto-restart on failure.
Deploy to production cluster:
# 1. Apply manifests
make k8s-deploy
# 2. Check status
kubectl get pods
kubectl get services
# 3. Forward ports (local testing)
kubectl port-forward service/rag-service 8000:8000
# 4. View logs
kubectl logs -f deployment/rag-deployment
# 5. Cleanup
make k8s-deleteManifests:
k8s/qdrant-statefulset.yaml- Persistent Qdrantk8s/qdrant-service.yaml- Qdrant servicek8s/deployment.yaml- API deploymentk8s/service.yaml- LoadBalancer/NodePort
1. Langfuse traces not appearing
# Check environment variables
echo $LANGFUSE_PUBLIC_KEY
echo $LANGFUSE_SECRET_KEY
# Verify network access
curl https://cloud.langfuse.com2. Qdrant connection failed
# Check Qdrant is running
curl http://localhost:6333/
docker ps | grep qdrant
# Restart Qdrant
docker restart qdrant3. Streamlit blank page
# Check logs for import errors
make logs-streamlit
# Verify dependencies
pip list | grep streamlit4. FlashRank model download issues
# Pre-download model
python -c "from flashrank import Ranker; Ranker(model_name='ms-marco-MiniLM-L-12-v2', cache_dir='./opt')"
# Check cache directory
ls -lah opt/5. Docker build errors
# Clean rebuild
make rebuild
# Check Docker resources
docker system df
docker builder pruneEnable detailed logging:
export LOG_LEVEL=DEBUG
export PYTHONPATH=.
# Run with debug output
uvicorn src.main:app --log-level debug# Run all tests
pytest tests/
# With coverage
pytest --cov=src tests/# Test API endpoints
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"query": "What is PDF?"}'
# Test health check
curl http://localhost:8000/health# Install Apache Bench
sudo apt-get install apache2-utils
# Run load test
ab -n 1000 -c 10 http://localhost:8000/healthGitHub Actions automatically:
- β Lints code with Ruff
- β Starts Qdrant service
- β Runs component initialization tests
- β Ingests test data
- β Runs RAG evaluation
- π¦ Uploads evaluation reports
See .github/workflows/ci.yml
- Langfuse v3 Docs
- Qdrant Documentation
- FlashRank GitHub
- Ragas Documentation
- FastAPI Docs
- Streamlit Docs
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Run linters (
make lint) - Submit a pull request
MIT License
Copyright (c) 2025 Andriy Vlonha
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
Built with:
- Langfuse - MLOps observability platform
- LangChain - LLM application framework
- Groq - Ultra-fast LLM inference
- Qdrant - Vector database
- FlashRank - Neural reranking
- Ragas - RAG evaluation
- π§ Email: [email protected]







