Skip to content

Production-grade RAG System: FastAPI, Qdrant & Groq. Features Langfuse observability, Prometheus/Grafana monitoring, W&B evaluation pipeline, and FlashRank reranking.

License

Notifications You must be signed in to change notification settings

Western-1/rag-doc-chat

Repository files navigation

πŸ“˜ Talk to Your Docs β€” Enterprise RAG System

Python FastAPI Qdrant GPT-OSS Groq Docker Kubernetes Ragas Langfuse FlashRank Streamlit Prometheus Grafana LangChain Make WandB

πŸ’‘ TL;DR β€” What this is

Talk to Your Docs is a production-grade Retrieval-Augmented Generation (RAG) microservice built for MLOps practitioners.

It ingests PDFs, cleans and chunks text, indexes embeddings into Qdrant, performs deep retrieval with FlashRank reranking, and uses an LLM (Groq / GPT-OSS-20B) to answer queries grounded in source documents.

πŸ†• What's New in v3

  • Langfuse v3 Support - Full compatibility with latest Langfuse SDK
  • Prometheus + Grafana - Production monitoring stack
  • Improved Architecture - Separated UI and API concerns
  • Enhanced Docker Compose - Multi-service orchestration
  • Better Error Handling - Graceful fallbacks for observability

πŸ“‚ Repository Layout

Talk_to_Your_Docs_RAG_System/
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── ci.yml              # GitHub Actions CI/CD
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ evaluate.py             # Ragas evaluation script
β”‚   └── report.csv              # Latest evaluation results
β”œβ”€β”€ images/                     # Screenshots for README
β”œβ”€β”€ k8s/                        # Kubernetes manifests
β”‚   β”œβ”€β”€ deployment.yaml
β”‚   β”œβ”€β”€ service.yaml
β”‚   β”œβ”€β”€ qdrant-statefulset.yaml
β”‚   └── qdrant-pvc.yaml
β”œβ”€β”€ opt/                        # FlashRank model cache
β”œβ”€β”€ qdrant_db/                  # Local Qdrant persistence
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ app.py                  # FastAPI application (UPDATED v3)
β”‚   β”œβ”€β”€ config.py               # Configuration (UPDATED)
β”‚   β”œβ”€β”€ ingestion.py            # PDF processing (UPDATED v3)
β”‚   β”œβ”€β”€ main.py                 # FastAPI entry point
β”‚   └── rag.py                  # RAG engine core (UPDATED v3)
β”œβ”€β”€ ui/
β”‚   └── streamlit_app.py        # Streamlit UI (UPDATED v3)
β”œβ”€β”€ tests/                      # Unit tests
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ .env                        # Environment variables
β”œβ”€β”€ .env.example
β”œβ”€β”€ .gitignore
β”œβ”€β”€ docker-compose.yml          # Multi-service setup (UPDATED)
β”œβ”€β”€ Dockerfile                  # Python 3.11 image (UPDATED)
β”œβ”€β”€ Dockerfile.qdrant           # Custom Qdrant image
β”œβ”€β”€ Makefile                    # Development commands (UPDATED)
β”œβ”€β”€ prometheus.yml              # Prometheus config (NEW)
β”œβ”€β”€ requirements.txt            # Dependencies (Langfuse v3)
β”œβ”€β”€ requirements-dev.txt        # Dependencies (Local dev)
└── README.md                   # This file

πŸ’» Tech Stack

Core Components

  • 🐍 Python 3.11 - Main runtime
  • ⚑ FastAPI - REST API (/chat, /ingest, /feedback, /health)
  • πŸ‘‘ Streamlit - Interactive UI for demos
  • πŸ’Ύ Qdrant - Vector database (port 6333)
  • ⚑ FlashRank - Cross-encoder reranker
  • πŸ€– LLM for generation:
    • Groq β€” Ultra-fast inference platform
    • GPT-OSS β€” LLM models

MLOps Stack (NEW/UPDATED)

  • πŸ•΅οΈ Langfuse v3 - Tracing & observability with compatibility layer
  • πŸ“ˆ Prometheus - Metrics collection (port 9090)
  • πŸ“Š Grafana - Metrics visualization (port 3000)
  • πŸ“Š Ragas - Automated RAG evaluation
  • 🐳 Docker Compose - Multi-container orchestration
  • ☸️ Kubernetes - Production deployment

✨ Features

Core RAG

  • πŸ“„ Page-aware PDF ingestion with metadata preservation
  • 🧹 Intelligent text cleaning (hyphenation, citations, null bytes)
  • πŸ” Chunk deduplication via MD5 hashing
  • 🧠 Multi-query generation for better recall
  • πŸ” Deep retrieval (k=50) + FlashRank reranking (top-7)
  • πŸ›‘οΈ Strict prompt templates to reduce hallucinations
  • πŸ’¬ Chat history support for conversational context

MLOps & Observability (v3)

  • πŸ†” Trace IDs - Every answer links to Langfuse trace
  • πŸ‘ Feedback loop - Thumbs up/down for continuous improvement
  • πŸ“Š Prometheus metrics - Latency, throughput, errors
  • πŸ“ˆ Grafana dashboards - Real-time monitoring
  • βš™οΈ Background ingestion - Non-blocking PDF processing
  • πŸ”„ Graceful fallbacks - Robust error handling

⚑ Quickstart

Prerequisites

  • Docker & Docker Compose
  • Python 3.11+
  • Groq API key (Get it here)
  • Langfuse account (Sign up)

Option 1: Docker Compose (Recommended)

# 1. Clone repository
git clone <repo-url>
cd Talk_to_Your_Docs_RAG_System

# 2. Set up environment variables
cp .env.example .env
# Edit .env and add:
# - GROQ_API_KEY=gsk_...
# - LANGFUSE_PUBLIC_KEY=pk-lf-...
# - LANGFUSE_SECRET_KEY=sk-lf-...

# 3. Start all services
make up
# Or: docker compose up -d

# 4. Access services
# - Streamlit UI: http://localhost:8501
# - FastAPI docs: http://localhost:8000/docs
# - Prometheus: http://localhost:9090
# - Grafana: http://localhost:3000 (admin/admin)
# - Qdrant: http://localhost:6333

Option 2: Local Development

# 1. Install dependencies
make install
# Or: uv venv && uv pip install -r requirements.txt

# 2. Activate virtual environment
source venv/bin/activate

# 3. Start Qdrant (in separate terminal)
docker run -p 6333:6333 qdrant/qdrant

# 4A. Run Streamlit UI
make ui
# Or: streamlit run ui/streamlit_app.py

# 4B. Run FastAPI
make dev
# Or: uvicorn src.main:app --reload

πŸ”§ Configuration

Edit src/config.py or use environment variables in .env:

Required

GROQ_API_KEY=gsk_your_key_here
LANGFUSE_PUBLIC_KEY=pk-lf-your_key
LANGFUSE_SECRET_KEY=sk-lf-your_secret

Optional

QDRANT_URL=http://localhost:6333
LANGFUSE_HOST=https://cloud.langfuse.com
COLLECTION_NAME=rag_documents
LLM_MODEL=openai/gpt-oss-20b
EMBEDDING_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
LOG_LEVEL=INFO

🧠 How It Works

1. Document Ingestion

Ingestion Pipeline

  • Upload PDF β†’ Extract text per page
  • Clean text β†’ Remove hyphenation, null bytes, citations
  • Split into chunks β†’ RecursiveCharacterTextSplitter
  • Generate hashes β†’ MD5 for deduplication
  • Index to Qdrant β†’ Store embeddings with metadata

2. Query Pipeline

RAG Pipeline

  1. Multi-query generation - Generate 3 variations of user query
  2. Deep retrieval - Fetch top-50 chunks per query from Qdrant
  3. FlashRank reranking - Cross-encoder reranks to top-7
  4. LLM generation - Generate answer grounded in context
  5. Trace capture - Return answer + trace_id for feedback

3. Observability (Langfuse v3)

Langfuse Trace

  • Automatic tracing via @observe decorators
  • Token counting - Input/output tokens tracked
  • Latency tracking - Each step measured
  • Feedback loop - Thumbs up/down linked to traces

πŸ“‘ API Reference

POST /chat

Query the RAG system.

Request:

{
  "query": "What is PDF?"
}

Response:

{
  "answer": "PDF stands for Portable Document Format...",
  "trace_id": "trace-abc-123",
  "sources": [
    {
      "text": "PDF was created by Adobe...",
      "meta": {"source": "doc.pdf", "page": 1}
    }
  ]
}

POST /feedback

Submit user feedback for a trace.

Request:

{
  "trace_id": "trace-abc-123",
  "score": 1.0,
  "comment": "Helpful answer"
}

POST /ingest

Upload PDF for background processing.

Request:

curl -X POST http://localhost:8000/ingest \
  -F "[email protected]"

GET /health

Health check endpoint.

Response:

{"status": "healthy"}

GET /metrics

Prometheus metrics endpoint.


πŸ–₯️ Streamlit UI

The UI is designed for production workloads with:

  • Custom boot sequence - Visual feedback during model loading
  • Asynchronous ingestion - Non-blocking PDF processing
  • Real-time feedback - Thumbs up/down integrated with Langfuse
  • Source citations - Show page numbers and text snippets

Streamlit UI Boot sequence with lazy loading of heavy models

Document Processing Real-time ingestion progress

Chat Interface Interactive chat with source citations


πŸ“Š Monitoring & Observability

Langfuse Dashboard

  • Traces - Every RAG pipeline execution
  • Scores - User feedback (thumbs up/down)
  • Prompts - Version-controlled system prompts
  • Analytics - Token usage, costs, latency

Langfuse Trace

Prometheus Metrics

Key metrics exposed at /metrics:

  • http_requests_total - Total API calls
  • http_request_duration_seconds - Latency histogram
  • http_requests_in_progress - Concurrent requests

Access Prometheus at http://localhost:9090

Prometheus target dashboard

Grafana Dashboards

Pre-configured dashboards for:

  • API latency (p50, p95, p99)
  • Error rates
  • Throughput (requests/sec)
  • Qdrant performance

Fast link

Service URL Credentials
Streamlit UI http://localhost:8501 None
API Docs http://localhost:8000/docs None
Grafana http://localhost:3000 admin / admin
Prometheus http://localhost:9090 None

Note

All services are intended to run locally. Grafana uses default credentials on first start; change them in production.

Access Grafana at http://localhost:3000 (admin/admin)

Grafana dashboard

Grafana Dashboard simplified

Grafana dashboard Simplified View


πŸ§ͺ Evaluation

πŸ“Š Evaluation & Tracking

We use Ragas for checking quality and Weights & Biases for experiment tracking.

RAG Evaluation Results using W&B

Running Experiments

Run evaluation pipeline:

make eval
# Or: 
# 1) - python evaluation/track_experiment.py
# 2) 1) - python evaluation/evaluate.py

Tracked Experiment (with W&B)

Metric Score Description
Faithfulness 1.00 Zero hallucinations
Context Precision 1.00 Perfect retrieval
Answer Relevancy N/a (Rate limited in free tier) or 0.83 without free tier

Latest Results (evaluate.py):

Metric Score Description
Faithfulness 1.00 Zero hallucinations
Context Precision 1.00 Perfect retrieval
Answer Relevancy 0.67 High alignment

Evaluation Results

Performance Benchmarks

Configuration Recall Precision Hallucination Rate
Standard RAG 68% 72% Low
Deep RAG + Rerank 94% 89% Near Zero

πŸ› οΈ Makefile Commands

Development

make install         # Install dependencies
make dev            # Run FastAPI with hot reload
make ui             # Run Streamlit UI
make lint           # Run ruff linter
make eval           # Run evaluation pipeline

Docker

make build          # Build Docker image
make up             # Start all services
make down           # Stop all services
make restart        # Restart services
make rebuild        # Rebuild from scratch
make logs           # Tail all logs
make logs-api       # Tail API logs
make logs-streamlit # Tail Streamlit logs
make ps             # Show service status

Database

make clean-db       # Delete Qdrant collection

Kubernetes

make k8s-deploy     # Deploy to K8s
make k8s-delete     # Remove from K8s
make k8s-logs       # View K8s logs
make k8s-forward    # Port forward service

Cleanup

make clean          # Remove Python caches
make clean-volumes  # Remove Docker volumes
make clean-all      # Complete cleanup

🐳 Docker Compose Services

services:
  qdrant:           # Vector database (port 6333)
  api:              # FastAPI backend (port 8000)
  streamlit:        # Streamlit UI (port 8501)
  prometheus:       # Metrics collector (port 9090)
  grafana:          # Dashboards (port 3000)

All services are networked and auto-restart on failure.


☸️ Kubernetes Deployment

Deploy to production cluster:

# 1. Apply manifests
make k8s-deploy

# 2. Check status
kubectl get pods
kubectl get services

# 3. Forward ports (local testing)
kubectl port-forward service/rag-service 8000:8000

# 4. View logs
kubectl logs -f deployment/rag-deployment

# 5. Cleanup
make k8s-delete

Manifests:

  • k8s/qdrant-statefulset.yaml - Persistent Qdrant
  • k8s/qdrant-service.yaml - Qdrant service
  • k8s/deployment.yaml - API deployment
  • k8s/service.yaml - LoadBalancer/NodePort

πŸ” Troubleshooting

Common Issues

1. Langfuse traces not appearing

# Check environment variables
echo $LANGFUSE_PUBLIC_KEY
echo $LANGFUSE_SECRET_KEY

# Verify network access
curl https://cloud.langfuse.com

2. Qdrant connection failed

# Check Qdrant is running
curl http://localhost:6333/
docker ps | grep qdrant

# Restart Qdrant
docker restart qdrant

3. Streamlit blank page

# Check logs for import errors
make logs-streamlit

# Verify dependencies
pip list | grep streamlit

4. FlashRank model download issues

# Pre-download model
python -c "from flashrank import Ranker; Ranker(model_name='ms-marco-MiniLM-L-12-v2', cache_dir='./opt')"

# Check cache directory
ls -lah opt/

5. Docker build errors

# Clean rebuild
make rebuild

# Check Docker resources
docker system df
docker builder prune

Debug Mode

Enable detailed logging:

export LOG_LEVEL=DEBUG
export PYTHONPATH=.

# Run with debug output
uvicorn src.main:app --log-level debug

πŸ§ͺ Testing

Unit Tests

# Run all tests
pytest tests/

# With coverage
pytest --cov=src tests/

Integration Tests

# Test API endpoints
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"query": "What is PDF?"}'

# Test health check
curl http://localhost:8000/health

Load Testing

# Install Apache Bench
sudo apt-get install apache2-utils

# Run load test
ab -n 1000 -c 10 http://localhost:8000/health

πŸš€ CI/CD Pipeline

GitHub Actions automatically:

  1. βœ… Lints code with Ruff
  2. βœ… Starts Qdrant service
  3. βœ… Runs component initialization tests
  4. βœ… Ingests test data
  5. βœ… Runs RAG evaluation
  6. πŸ“¦ Uploads evaluation reports

See .github/workflows/ci.yml


πŸ“š Additional Resources


🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Run linters (make lint)
  5. Submit a pull request

πŸ”“ License

MIT License

Copyright (c) 2025 Andriy Vlonha

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.


πŸ™ Acknowledgments

Built with:

  • Langfuse - MLOps observability platform
  • LangChain - LLM application framework
  • Groq - Ultra-fast LLM inference
  • Qdrant - Vector database
  • FlashRank - Neural reranking
  • Ragas - RAG evaluation

πŸ“ž Support