A locally hosted RAG (Retrieval-Augmented Generation) system for intelligent document research. Upload your own private documents, ask questions in plain English, and get AI-powered answers with source citations—all running privately on your machine.
Obviously no current laptops has the GPU power to use top of the line models and techniques across a full RAG pipeline. This project aims to allow the evaluation of multiple small models and features together to find an optimum pipeline for given set of documents that is still reasonably usable on a local computer.
- Question Answering: Local LLM used for interacting with documents.
- Document Formats: Currently supports txt, md, pdf, docx, pptx, xlsx, html.
- Chat History: Session-based conversation memory persists across restarts
- Local Deployment: Can runs entirely on your machine with no external API dependencies
- Web Interface: ChagGPT like interface with document management.
- Contextual Retrieval: Document context embedded with chunks for improved accuracy
- Structure Preservation: Maintains document hierarchy including headings, sections, and tables
- Hybrid Search: BM25 keyword matching combined with vector semantic search
- Reranking: Cross-encoder model for result refinement. Uses a small memory loaded model.
- Async Processing: Background document indexing via Celery task queue
- Progress Tracking: Real-time status updates for document uploads
- Source Attribution: Results include document excerpts and metadata
- Frontend: SvelteKit + Tailwind CSS.
- Backend: Python + FastAPI
- Vector Database: ChromaDB
- AI Models: Ollama (local LLM and embeddings)
- Document Processing: Docling + LlamaIndex
- Task Queue: Celery + Redis
- Package Management: uv
┌────────────────────────────────────┐
│ External: Ollama (Host Machine) │
End User │ host.docker.internal:11434 │
(browser) │ - LLM: gemma3:4b │
│ │ - Embeddings: nomic-embed-text │
│ └────────────────────────────────────┘
│ │
│ Public Network (host) │
--------------│------------------------------│-------------------
│ Private Network (Docker) │
▼ │
┌─────────────────┐ ┌──────────────────────┐
│ WebApp │ HTTP │ RAG Server │
│ (SvelteKit) │◄─────────►│ (FastAPI) │
│ Port: 8000 │ │ Port: 8001 │
└─────────────────┘ │ │
│ ┌────────────────┐ │
│ │ Docling │ │
│ │ + LlamaIndex │ │
│ │ + Hybrid Search│ │
│ │ + Reranking │ │
│ └────────────────┘ │
└──────────┬───────────┘
│
│
┌─────────────┬───────────────────┼─────────┐
│ │ │ │
┌────▼─────┐ ┌────▼────┐ ┌──────▼──────┐ │
│ ChromaDB │ │ Redis │ │ Celery │ │
│ (Vector │ │(Message │ │ Worker │ │
│ DB) │ │ Broker) │ │ (Async │ │
└──────────┘ └─────────┘ │ Processing) │ │
└──────┬──────┘ │
│ │
┌───────────────────┐
│ Shared Volume │
│ /tmp/shared │
│ (File Transfer) │
└───────────────────┘
- Docker container host (Docker Desktop, OrbStack, or Podman)
- Ollama for running AI models locally
- Python 3.13 and uv package manager
# Install Docker alternative (faster than Docker Desktop)
brew install orbstack
# Install uv package manager
brew install uv
# Install Ollama (recommended method)
curl https://ollama.ai/install.sh | sh
# Pull required models
ollama pull gemma3:4b # Inference model
ollama pull nomic-embed-text # Embedding modelgit clone [email protected]:gittycat/rag-docling.git
cd rag-docling# Create secrets directory
mkdir -p secrets
# Add Ollama configuration if needed
echo "OLLAMA_HOST=http://host.docker.internal:11434" > secrets/ollama_config.envdocker compose upOpen the WebApp at http://localhost:8000
- Open http://localhost:8000
- Navigate to the Admin section
- Click "Upload Documents"
- Select files (PDF, DOCX, TXT, MD, etc.)
- Monitor progress in real-time
- Use the main page to query your documents
- View Documents: See all indexed documents in the Admin section
- Delete Documents: Remove documents you no longer need
- Clear Chat: Start a fresh conversation anytime
Basic configuration is handled through environment variables in docker-compose.yml. For most users, the defaults work well.
- Models: Change LLM or embedding models (default: gemma3:4b, nomic-embed-text)
- Retrieval: Adjust number of results returned (default: 10)
- Features: Enable/disable hybrid search, contextual retrieval, reranking
For detailed configuration options, see DEVELOPMENT.md.
Most of the extra documentation is for use by the Claude Code environment development. It includes more in depth search on techniques to improve accuracy. I will probably migrate to using Claude Skills for some of this info in the future.
- DEVELOPMENT.md: API documentation, configuration details, troubleshooting
- CLAUDE.md: Project guide for Claude Code development
- docs/: Additional guides and implementation details
TODO: Detail the roadmap.
Overall, the immediate goal is to improve evaluation (make it easier to compare features/models), then migrating the application online which will require security and data privacy improvements.
MIT License