Ollama RAG Docling is a state-of-the-art Retrieval-Augmented Generation (RAG) system designed to provide intelligent, context-aware question-answering over your private document and media collections. It leverages local Large Language Models (LLMs) via Ollama and incorporates advanced document processing techniques from Docling and cutting-edge RAG research from 2025 to achieve high accuracy and performance.
- Overview
- System Architecture
- Key Features
- Technology Stack
- Project Structure
- Installation Guide
- Configuration
- Component Documentation
- API Reference
- Usage Examples
- Deployment
- Troubleshooting
- Contributing
Retrieval-Augmented Generation (RAG) enhances LLM responses by:
- Retrieving relevant information from a knowledge base
- Augmenting the LLM prompt with retrieved context
- Generating accurate, context-aware responses
- β Privacy-First: Runs entirely locally using Ollama
- β Multi-Format Support: PDFs, images, documents with OCR
- β Advanced Chunking: Smart document segmentation strategies
- β Vector Search: Fast similarity search with LanceDB
- β Modern UI: Responsive Next.js interface with streaming
- β Production-Ready: Docker support, batch processing, health checks
flowchart TB
title["Ollama RAG Docling System Architecture"]
subgraph Frontend["π¨ Frontend Layer"]
UI[Next.js UI]
Components[React Components]
API_Client[API Client]
end
subgraph Backend["βοΈ Backend Layer"]
FastAPI[FastAPI Server]
RAG_System[RAG System]
Server[Backend Server]
end
subgraph Processing["π Document Processing"]
Docling[Docling Integration]
Chunker[Enhanced Chunker]
VLM[Vision Language Model]
end
subgraph Indexing["π Indexing Pipeline"]
Embedder[Embedders]
Contextualizer[Contextualizer]
GraphExtractor[Graph Extractor]
end
subgraph Retrieval["π Retrieval Pipeline"]
QueryTransformer[Query Transformer]
Retrievers[Hybrid Retrievers]
Reranker[Reranker]
end
subgraph Storage["πΎ Storage Layer"]
LanceDB[(LanceDB Vector Store)]
SQLite[(SQLite Database)]
end
subgraph LLM["π€ LLM Layer"]
Ollama[Ollama LLM]
Models[Local Models]
end
UI --> API_Client
API_Client --> FastAPI
FastAPI --> RAG_System
RAG_System --> Server
Server --> Processing
Processing --> Docling
Docling --> Chunker
Chunker --> VLM
VLM --> Indexing
Indexing --> Embedder
Embedder --> Contextualizer
Contextualizer --> GraphExtractor
GraphExtractor --> LanceDB
GraphExtractor --> SQLite
RAG_System --> Retrieval
Retrieval --> QueryTransformer
QueryTransformer --> Retrievers
Retrievers --> Reranker
Reranker --> LanceDB
Reranker --> Ollama
Ollama --> Models
Models --> RAG_System
RAG_System --> FastAPI
FastAPI --> API_Client
classDef frontend fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
classDef backend fill:#e0f2f1,stroke:#00695c,stroke-width:2px
classDef processing fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef storage fill:#fce4ec,stroke:#880e4f,stroke-width:2px
classDef llm fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
class UI,Components,API_Client frontend
class FastAPI,RAG_System,Server backend
class Docling,Chunker,VLM,Indexing,Retrieval processing
class LanceDB,SQLite storage
class Ollama,Models llm
sequenceDiagram
participant User
participant Frontend
participant FastAPI
participant RAG
participant Docling
participant LanceDB
participant Ollama
User->>Frontend: Upload Document
Frontend->>FastAPI: POST /upload
FastAPI->>Docling: Process Document
Docling->>Docling: Extract Text & Images
Docling-->>FastAPI: Processed Content
FastAPI->>RAG: Index Document
RAG->>RAG: Chunk Document
RAG->>RAG: Generate Embeddings
RAG->>LanceDB: Store Vectors
LanceDB-->>RAG: Confirmation
RAG-->>FastAPI: Indexing Complete
FastAPI-->>Frontend: Success Response
User->>Frontend: Ask Question
Frontend->>FastAPI: POST /query
FastAPI->>RAG: Process Query
RAG->>LanceDB: Vector Search
LanceDB-->>RAG: Retrieved Chunks
RAG->>RAG: Rerank Results
RAG->>Ollama: Generate Response
Ollama-->>RAG: Stream Response
RAG-->>FastAPI: Stream Tokens
FastAPI-->>Frontend: SSE Stream
Frontend-->>User: Display Answer
flowchart LR
subgraph Input["π₯ Input"]
Docs[Documents]
PDFs[PDFs]
Images[Images]
end
subgraph Ingestion["π Ingestion"]
DocConv[Document Converter]
OCR[OCR Engine]
AudioTrans[Audio Transcriber]
end
subgraph Processing["βοΈ Processing"]
Chunk[Chunking]
Embed[Embedding]
Context[Contextualization]
end
subgraph Storage["πΎ Vector Store"]
Lance[(LanceDB)]
Meta[Metadata Store]
end
subgraph Retrieval["π Retrieval"]
Query[Query Processing]
Search[Hybrid Search]
Rerank[Reranking]
end
subgraph Generation["π€ Generation"]
LLM[Ollama LLM]
Stream[Response Stream]
end
subgraph Output["π€ Output"]
Answer[Generated Answer]
Sources[Source Citations]
end
Docs --> DocConv
PDFs --> DocConv
Images --> OCR
DocConv --> Chunk
OCR --> Chunk
AudioTrans --> Chunk
Chunk --> Embed
Embed --> Context
Context --> Lance
Context --> Meta
Query --> Search
Search --> Lance
Lance --> Rerank
Rerank --> LLM
LLM --> Stream
Stream --> Answer
Meta --> Sources
classDef input fill:#e3f2fd,stroke:#1976d2
classDef processing fill:#f3e5f5,stroke:#7b1fa2
classDef storage fill:#fff3e0,stroke:#f57c00
classDef output fill:#e8f5e9,stroke:#388e3c
class Docs,PDFs,Images input
class Chunk,Embed,Context processing
class Lance,Meta storage
class Answer,Sources output
This system now includes state-of-the-art 2025 RAG research:
- 8,192 token context (16x more than standard rerankers)
- Token-level matching for precise code/table queries
- +20-30% accuracy on table queries
- +15-25% accuracy on code queries
- 89 languages supported (vs English-only alternatives)
- Automatic fallback to answerai-colbert-small-v1 if needed
- Generates multiple answers and selects most consistent
- -40% hallucination rate on critical queries
- Flags low-consistency answers with warnings
- Based on ICLR 2025 research (arXiv:2505.09031)
- Specialized embeddings per content type:
- Text β Qwen3-Embedding-0.6B
- Code β CodeBERT
- Tables β ColBERT (token-level)
- +10-15% accuracy on code queries
- +8-12% accuracy on table queries
- Automatic content type detection
- 0.85-0.90 AMI scores (Discover Computing, 2025)
- Preserves semantic coherence within chunks
- +10-15% chunk quality
- -8% chunk count (fewer redundant splits)
- Alternative to traditional fixed-size chunking
- Uses Provence model (ICLR 2025)
- -30% token usage with +8-12% precision
- Removes irrelevant sentences before generation
- Reduces noise in context window
- Query live data sources (APIs, databases, feeds)
- Built-in sources: Weather, Stock prices, Database queries
- Combines static documents with fresh data
- Extensible API source registration
- TTL-based caching (60s default)
- Multi-format Support: PDF, DOCX, PPTX, images, HTML
- OCR Capabilities: Extracts text from images and scanned documents
- Table Extraction: Preserves tabular data structure
- Layout Analysis: Maintains document hierarchy
- Vision-Language Model: Processes images with VLM descriptions
# Multiple Chunking Options
- Semantic Chunking: Groups related content
- Max-Min Chunking: 0.85-0.90 AMI scores (NEW)
- Fixed-Size Chunking: Consistent chunk sizes
- Hierarchical Chunking: Preserves document structure
- LaTeX Chunking: Special handling for math content
- Sliding Window: Overlapping chunks for context- Fast vector similarity search
- HNSW indexing for scalability
- Metadata filtering
- Multi-vector support
Retrieval Methods:
βββ Dense Retrieval (Embeddings)
βββ Sparse Retrieval (BM25)
βββ Graph-based Retrieval
βββ Hybrid Fusion (RRF)- Local model inference
- Multiple model support (Llama, Mistral, etc.)
- Streaming responses
- Context-aware generation
- Query understanding
- Multi-step reasoning
- Answer verification
- Source attribution
- Real-time chat interface
- Streaming responses
- Session management
- Index selection
- Model configuration
- Document upload progress
- Source citations
| Technology | Purpose | Version |
|---|---|---|
| Python | Core Backend | 3.8+ |
| FastAPI | Web Framework | Latest |
| Docling | Document Processing | Latest |
| LanceDB | Vector Database | Latest |
| Sentence Transformers | Embeddings | Latest |
| NetworkX | Graph Processing | Latest |
| PyMuPDF | PDF Processing | Latest |
| NLTK | NLP Operations | Latest |
| Technology | Purpose | Version |
|---|---|---|
| Next.js | React Framework | 14+ |
| TypeScript | Type Safety | Latest |
| Tailwind CSS | Styling | Latest |
| React Markdown | Markdown Rendering | Latest |
| Lucide React | Icons | Latest |
| Radix UI | UI Components | Latest |
| Technology | Purpose |
|---|---|
| Ollama | Local LLM Inference |
| Colpali-Engine | Multi-modal Embeddings |
| Rerankers | Result Reranking |
| Transformers | Model Management |
| PyTorch | Deep Learning |
ollama-rag-docling/
β
βββ π backend/ # Backend server
β βββ database.py # Database operations
β βββ server.py # FastAPI server
β βββ ollama_client.py # Ollama integration
β βββ simple_pdf_processor.py # Basic PDF processing
β βββ requirements.txt # Backend dependencies
β
βββ π rag_system/ # Core RAG system
β β
β βββ π agent/ # Agentic components
β β βββ loop.py # Agentic RAG loop
β β βββ verifier.py # Answer verification
β β
β βββ π indexing/ # Indexing pipeline
β β βββ contextualizer.py # Context enrichment
β β βββ embedders.py # Embedding models
β β βββ graph_extractor.py # Knowledge graph
β β βββ latechunk.py # LaTeX chunking
β β βββ multimodal.py # Multi-modal processing
β β βββ overview_builder.py # Document summaries
β β βββ representations.py # Multi-representation
β β
β βββ π ingestion/ # Document ingestion
β β βββ audio_transcriber.py # Audio processing
β β βββ chunking.py # Document chunking
β β βββ docling_chunker.py # Docling integration
β β βββ docling_integration.py # Main Docling interface
β β βββ docling_vlm_converter.py # Vision-Language Model
β β βββ document_converter.py # Format conversion
β β βββ enhanced_hybrid_chunker.py # Advanced chunking
β β
β βββ π pipelines/ # Processing pipelines
β β βββ indexing_pipeline.py # Indexing workflow
β β βββ retrieval_pipeline.py # Retrieval workflow
β β
β βββ π rerankers/ # Result reranking
β β βββ reranker.py # Reranking logic
β β βββ sentence_pruner.py # Context pruning
β β
β βββ π retrieval/ # Retrieval components
β β βββ query_transformer.py # Query processing
β β βββ retrievers.py # Search methods
β β
β βββ π utils/ # Utilities
β β βββ batch_processor.py # Batch operations
β β βββ logging_utils.py # Logging setup
β β βββ ollama_client.py # Ollama wrapper
β β βββ validate_model_config.py # Config validation
β β
β βββ api_server.py # RAG API server
β βββ main.py # Main entry point
β βββ factory.py # Component factory
β
βββ π src/ # Frontend source
β β
β βββ π app/ # Next.js app
β β βββ layout.tsx # Root layout
β β βββ page.tsx # Home page
β β βββ globals.css # Global styles
β β
β βββ π components/ # React components
β β βββ demo.tsx # Demo component
β β βββ IndexForm.tsx # Index creation form
β β βββ IndexPicker.tsx # Index selector
β β βββ IndexWizard.tsx # Setup wizard
β β βββ LandingMenu.tsx # Landing page
β β βββ Markdown.tsx # Markdown renderer
β β βββ ModelSelect.tsx # Model selector
β β βββ SessionIndexInfo.tsx # Session info
β β β
β β βββ π ui/ # UI components
β β βββ chat-bubble.tsx # Chat messages
β β βββ chat-input.tsx # Message input
β β βββ conversation-page.tsx # Chat page
β β βββ session-chat.tsx # Session chat
β β βββ session-sidebar.tsx # Chat sidebar
β β βββ [more UI components]
β β
β βββ π lib/ # Utilities
β βββ api.ts # API client
β βββ types.ts # TypeScript types
β βββ utils.ts # Helper functions
β
βββ π Documentation/ # Project docs
β βββ api_reference.md # API documentation
β βββ architecture_overview.md # Architecture guide
β βββ deployment_guide.md # Deployment instructions
β βββ docker_usage.md # Docker guide
β βββ indexing_pipeline.md # Indexing details
β βββ installation_guide.md # Setup instructions
β βββ quick_start.md # Quick start guide
β βββ retrieval_pipeline.md # Retrieval details
β βββ system_overview.md # System overview
β
βββ π docker-compose.yml # Docker orchestration
βββ π docker-compose.local-ollama.yml # Local Ollama setup
βββ π Dockerfile # Container definition
βββ π requirements.txt # Python dependencies
βββ π package.json # Node.js dependencies
βββ π next.config.ts # Next.js config
βββ π tailwind.config.js # Tailwind config
β
βββ π§ create_index_script.py # Index creation
βββ π§ demo_batch_indexing.py # Batch indexing demo
βββ π§ run_system.py # System launcher
βββ π§ system_health_check.py # Health check
β
βββ π setup_rag_system.sh # Linux setup script
βββ π setup_rag_system.bat # Windows setup script
βββ π start-docker.sh # Docker startup (Linux)
βββ π start-docker.bat # Docker startup (Windows)
- Python: 3.8 or higher
- Node.js: 16 or higher
- Ollama: Installed and running
- Git: For cloning repository
- Docker (Optional): For containerized deployment
# Clone the repository
git clone https://github.com/vedantparmar12/ollama-rag-docling.git
cd ollama-rag-docling
# Make setup script executable
chmod +x setup_rag_system.sh
# Run setup script
./setup_rag_system.sh# Clone the repository
git clone https://github.com/vedantparmar12/ollama-rag-docling.git
cd ollama-rag-docling
# Run setup script
setup_rag_system.batThe setup script will:
- β Check Python and Node.js installation
- β Create virtual environment
- β Install Python dependencies
- β Install Node.js dependencies
- β Verify Ollama installation
- β Download required models
- β Initialize database
git clone https://github.com/vedantparmar12/ollama-rag-docling.git
cd ollama-rag-docling# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/macOS:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install backend-specific dependencies
cd backend
pip install -r requirements.txt
cd ..# Install Node.js dependencies
npm install
# Build frontend
npm run build# Install recommended models
ollama pull llama3.2
ollama pull nomic-embed-text# Install dependencies for 2025 enhancements
pip install rerankers scikit-learn httpx
# Test new features
python test_advanced_features.py# Run system health check
python system_health_check.py
# Start the system
python run_system.py# Start all services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down# Start with local Ollama
docker-compose -f docker-compose.local-ollama.yml up -d# Linux/macOS
chmod +x start-docker.sh
./start-docker.sh
# Windows
start-docker.batCreate a .env file in the root directory:
# Ollama Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3.2
EMBEDDING_MODEL=nomic-embed-text
# Backend Configuration
BACKEND_PORT=5328
BACKEND_HOST=0.0.0.0
# Frontend Configuration
NEXT_PUBLIC_API_URL=http://localhost:5328
FRONTEND_PORT=3000
# RAG Configuration
CHUNK_SIZE=512
CHUNK_OVERLAP=128
TOP_K=10
RERANK_TOP_K=5
# Vector Database
VECTOR_DB_PATH=./data/vector_store
METADATA_DB_PATH=./data/metadata.db
# Processing
MAX_WORKERS=4
BATCH_SIZE=10
# Logging
LOG_LEVEL=INFO
LOG_FILE=./logs/system.logEdit batch_indexing_config.json:
{
"llm_model": "llama3.2",
"embedding_model": "nomic-embed-text",
"chunk_size": 512,
"chunk_overlap": 128,
"top_k": 10,
"rerank_top_k": 5,
"use_multimodal": true,
"use_contextual_embeddings": true,
"use_graph_extraction": false
}# In enhanced_hybrid_chunker.py
CHUNKING_CONFIG = {
"strategy": "hybrid", # Options: semantic, fixed, hierarchical, hybrid
"chunk_size": 512,
"chunk_overlap": 128,
"min_chunk_size": 100,
"max_chunk_size": 1024,
"respect_sentence_boundaries": True,
"preserve_structure": True
}class DoclingIntegration:
"""
Main interface for Docling document processing.
Features:
- Multi-format document conversion
- OCR for images and scanned documents
- Table extraction and preservation
- Layout analysis and hierarchy preservation
- Vision-Language Model integration
"""
def convert_document(self, file_path: str) -> ConversionResult:
"""
Convert document to processable format.
Args:
file_path: Path to document file
Returns:
ConversionResult with extracted text, tables, and images
"""
passSupported Formats:
- PDF (native and scanned)
- Microsoft Word (.docx)
- PowerPoint (.pptx)
- Images (PNG, JPG, TIFF)
- HTML documents
class EnhancedHybridChunker:
"""
Advanced document chunking with multiple strategies.
Strategies:
1. Semantic: Groups related content using embeddings
2. Fixed-Size: Consistent chunk sizes with overlap
3. Hierarchical: Preserves document structure
4. LaTeX: Special handling for mathematical content
"""
def chunk_document(
self,
text: str,
strategy: str = "hybrid"
) -> List[Chunk]:
"""
Chunk document using specified strategy.
Args:
text: Document text
strategy: Chunking strategy
Returns:
List of Chunk objects with text and metadata
"""
passclass SentenceTransformerEmbedder:
"""
Generate embeddings using Sentence Transformers.
Models:
- nomic-embed-text (default)
- all-MiniLM-L6-v2
- multi-qa-mpnet-base-dot-v1
"""
def embed_texts(
self,
texts: List[str],
batch_size: int = 32
) -> np.ndarray:
"""Generate embeddings for texts."""
passclass ContextualEmbedder:
"""
Enrich chunks with contextual information.
Features:
- Document-level context
- Section headers
- Surrounding chunks
- Metadata enrichment
"""
def contextualize_chunk(
self,
chunk: Chunk,
document_context: str
) -> Chunk:
"""Add contextual information to chunk."""
passclass GraphExtractor:
"""
Extract knowledge graph from documents.
Features:
- Entity extraction
- Relationship identification
- Graph construction
- NetworkX integration
"""
def extract_graph(
self,
chunks: List[Chunk]
) -> nx.Graph:
"""Extract knowledge graph from chunks."""
passclass QueryTransformer:
"""
Transform and expand user queries.
Techniques:
- Query expansion
- Synonym generation
- Multi-query generation
- Query rewriting
"""
def transform_query(
self,
query: str
) -> List[str]:
"""Generate multiple query variations."""
passclass HybridRetriever:
"""
Combine multiple retrieval methods.
Methods:
- Dense retrieval (vector similarity)
- Sparse retrieval (BM25)
- Graph-based retrieval
- Reciprocal Rank Fusion (RRF)
"""
def retrieve(
self,
query: str,
top_k: int = 10
) -> List[RetrievedChunk]:
"""Retrieve relevant chunks using hybrid approach."""
passclass CrossEncoderReranker:
"""
Rerank retrieved chunks using cross-encoder.
Models:
- ms-marco-MiniLM-L-12-v2 (default)
- cross-encoder/ms-marco-TinyBERT-L-2-v2
"""
def rerank(
self,
query: str,
chunks: List[RetrievedChunk],
top_k: int = 5
) -> List[RetrievedChunk]:
"""Rerank chunks by relevance."""
passclass AgenticRAGLoop:
"""
Multi-step reasoning for complex queries.
Steps:
1. Query understanding
2. Decomposition (if complex)
3. Iterative retrieval
4. Answer synthesis
5. Verification
"""
def process_query(
self,
query: str,
max_iterations: int = 3
) -> AgenticResponse:
"""Process query with multi-step reasoning."""
passclass AnswerVerifier:
"""
Verify answer quality and accuracy.
Checks:
- Factual consistency
- Source attribution
- Completeness
- Relevance
"""
def verify_answer(
self,
query: str,
answer: str,
sources: List[str]
) -> VerificationResult:
"""Verify answer quality."""
passhttp://localhost:5328
GET /healthResponse:
{
"status": "healthy",
"ollama_connected": true,
"models_available": ["llama3.2", "nomic-embed-text"],
"vector_store_ready": true
}POST /upload
Content-Type: multipart/form-dataParameters:
file: Document file (PDF, DOCX, image, etc.)index_name: Name of the index (optional)use_multimodal: Enable VLM processing (optional, default: true)
Response:
{
"success": true,
"document_id": "doc_123",
"chunks_created": 45,
"processing_time": 12.3
}POST /create_index
Content-Type: application/jsonRequest Body:
{
"index_name": "my_documents",
"documents": [
{
"path": "/path/to/document.pdf",
"metadata": {
"author": "John Doe",
"category": "research"
}
}
],
"config": {
"chunk_size": 512,
"chunk_overlap": 128,
"use_contextual_embeddings": true
}
}Response:
{
"success": true,
"index_name": "my_documents",
"total_chunks": 150,
"total_documents": 3,
"creation_time": 45.2
}POST /query
Content-Type: application/jsonRequest Body:
{
"query": "What is the main conclusion of the research?",
"index_name": "my_documents",
"top_k": 10,
"rerank_top_k": 5,
"stream": true
}Response (Streaming):
data: {"type": "retrieval", "chunks_found": 10}
data: {"type": "token", "content": "The"}
data: {"type": "token", "content": " main"}
data: {"type": "token", "content": " conclusion"}
data: {"type": "done", "sources": [...]}
GET /indicesResponse:
{
"indices": [
{
"name": "my_documents",
"chunks": 150,
"documents": 3,
"created_at": "2024-01-15T10:30:00",
"last_updated": "2024-01-15T15:45:00"
},
{
"name": "research_papers",
"chunks": 287,
"documents": 8,
"created_at": "2024-01-10T09:15:00",
"last_updated": "2024-01-14T14:20:00"
}
]
}DELETE /index/{index_name}Response:
{
"success": true,
"message": "Index 'my_documents' deleted successfully"
}GET /index/{index_name}/infoResponse:
{
"name": "my_documents",
"chunks": 150,
"documents": 3,
"embedding_model": "nomic-embed-text",
"chunk_size": 512,
"created_at": "2024-01-15T10:30:00",
"last_updated": "2024-01-15T15:45:00",
"size_mb": 25.4
}GET /modelsResponse:
{
"models": [
{
"name": "llama3.2",
"size": "3.2GB",
"modified": "2024-01-10T12:00:00"
},
{
"name": "mistral",
"size": "4.1GB",
"modified": "2024-01-08T09:30:00"
}
]
}# Edit rag_system/main.py
EXTERNAL_MODELS = {
"reranker_model": "jinaai/jina-colbert-v2", # 8192 tokens, multilingual
# ... rest of config
}
# The reranker will auto-detect ColBERT models and use late-interaction# In rag_system/main.py config
"advanced_features": {
"self_consistency": {
"enabled": True, # Enable for critical queries
"n_samples": 5,
"temperature": 0.7,
"consistency_threshold": 0.75
}
}
# Use in your code
from rag_system.agent.self_consistency import SelfConsistencyChecker
checker = SelfConsistencyChecker()
result = await checker.generate_with_consistency(generate_fn, query, context)
print(f"Consistency Score: {result['consistency_score']:.3f}")# In rag_system/main.py config
"advanced_features": {
"multimodal_embeddings": {
"enabled": True,
"enable_code": True,
"enable_table": True
}
}
# Use in indexing
from rag_system.indexing.multimodal_embedders import MultiModalEmbedder
embedder = MultiModalEmbedder(enable_code=True, enable_table=True)
enriched_chunks = embedder.embed_with_metadata(chunks)# Alternative to traditional chunking
from rag_system.ingestion.maxmin_chunker import MaxMinSemanticChunker
chunker = MaxMinSemanticChunker(
min_chunk_size=100,
max_chunk_size=1500,
similarity_threshold=0.80
)
chunks = chunker.chunk_text(document_text, document_id="doc_001")# Query live data + static documents
from rag_system.retrieval.realtime_retriever import RealtimeRetriever
retriever = RealtimeRetriever(static_retriever=your_rag_retriever)
# Queries like these will fetch real-time data:
results = await retriever.retrieve("What's the current weather in Paris?")
results = await retriever.retrieve("What is the price of AAPL stock?")
# Combines real-time + static document results
formatted = retriever.format_combined_context(results)import requests
# API endpoint
BASE_URL = "http://localhost:5328"
# Upload and index a document
with open("document.pdf", "rb") as f:
files = {"file": f}
data = {"index_name": "my_index"}
response = requests.post(
f"{BASE_URL}/upload",
files=files,
data=data
)
print(response.json())
# Query the index
query_data = {
"query": "What are the main findings?",
"index_name": "my_index",
"top_k": 10,
"rerank_top_k": 5
}
response = requests.post(
f"{BASE_URL}/query",
json=query_data
)
print(response.json())import json
# Load configuration
with open("batch_indexing_config.json", "r") as f:
config = json.load(f)
# Prepare documents
documents = [
{
"path": "/path/to/doc1.pdf",
"metadata": {"category": "research", "year": 2023}
},
{
"path": "/path/to/doc2.pdf",
"metadata": {"category": "technical", "year": 2024}
}
]
# Create index
index_data = {
"index_name": "batch_index",
"documents": documents,
"config": config
}
response = requests.post(
f"{BASE_URL}/create_index",
json=index_data
)
print(f"Indexed {response.json()['total_chunks']} chunks")# Create index using script
python create_index_script.py \
--index-name my_docs \
--input-dir ./documents \
--chunk-size 512 \
--use-multimodal
# Run batch indexing
python demo_batch_indexing.py \
--config batch_indexing_config.json \
--input-dir ./documents// src/lib/api.ts
export async function uploadDocument(
file: File,
indexName: string
): Promise<UploadResponse> {
const formData = new FormData();
formData.append('file', file);
formData.append('index_name', indexName);
const response = await fetch(`${API_URL}/upload`, {
method: 'POST',
body: formData
});
return response.json();
}
export async function queryIndex(
query: string,
indexName: string,
topK: number = 10
): Promise<QueryResponse> {
const response = await fetch(`${API_URL}/query`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query, index_name: indexName, top_k: topK })
});
return response.json();
}// Handle streaming responses
async function streamQuery(query: string, indexName: string) {
const response = await fetch(`${API_URL}/query`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
query,
index_name: indexName,
stream: true
})
});
const reader = response.body?.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader!.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
if (data.type === 'token') {
// Append token to UI
appendToken(data.content);
} else if (data.type === 'done') {
// Show sources
displaySources(data.sources);
}
}
}
}
}# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
backend:
build:
context: .
dockerfile: Dockerfile
container_name: rag_backend
ports:
- "5328:5328"
environment:
- OLLAMA_HOST=http://ollama:11434
- VECTOR_DB_PATH=/app/data/vector_store
volumes:
- ./data:/app/data
- ./logs:/app/logs
depends_on:
- ollama
restart: unless-stopped
frontend:
build:
context: .
dockerfile: Dockerfile.frontend
container_name: rag_frontend
ports:
- "3000:3000"
environment:
- NEXT_PUBLIC_API_URL=http://localhost:5328
depends_on:
- backend
restart: unless-stopped
volumes:
ollama_data:FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
g++ \
git \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements
COPY requirements.txt requirements-docker.txt ./
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir -r requirements-docker.txt
# Copy application code
COPY . .
# Create necessary directories
RUN mkdir -p data logs
# Expose port
EXPOSE 5328
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python system_health_check.py || exit 1
# Start application
CMD ["python", "run_system.py"]# Using AWS ECS
# 1. Build and push Docker image
docker build -t ollama-rag:latest .
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account-id>.dkr.ecr.us-east-1.amazonaws.com
docker tag ollama-rag:latest <account-id>.dkr.ecr.us-east-1.amazonaws.com/ollama-rag:latest
docker push <account-id>.dkr.ecr.us-east-1.amazonaws.com/ollama-rag:latest
# 2. Create ECS task definition
# 3. Create ECS service
# 4. Configure load balancer# Using Google Cloud Run
# 1. Build and push image
gcloud builds submit --tag gcr.io/PROJECT_ID/ollama-rag
# 2. Deploy to Cloud Run
gcloud run deploy ollama-rag \
--image gcr.io/PROJECT_ID/ollama-rag \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 4Gi \
--cpu 2# Using Azure Container Instances
# 1. Build and push image
az acr build --registry <registry-name> --image ollama-rag:latest .
# 2. Deploy container
az container create \
--resource-group myResourceGroup \
--name ollama-rag \
--image <registry-name>.azurecr.io/ollama-rag:latest \
--cpu 2 \
--memory 4 \
--port 5328# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-rag
spec:
replicas: 3
selector:
matchLabels:
app: ollama-rag
template:
metadata:
labels:
app: ollama-rag
spec:
containers:
- name: backend
image: ollama-rag:latest
ports:
- containerPort: 5328
env:
- name: OLLAMA_HOST
value: "http://ollama-service:11434"
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumeMounts:
- name: data
mountPath: /app/data
volumes:
- name: data
persistentVolumeClaim:
claimName: rag-data-pvc
---
apiVersion: v1
kind: Service
metadata:
name: ollama-rag-service
spec:
selector:
app: ollama-rag
ports:
- protocol: TCP
port: 80
targetPort: 5328
type: LoadBalancerProblem: Failed to connect to Ollama at http://localhost:11434
Solutions:
# Check if Ollama is running
ollama list
# Start Ollama service
# On Linux/macOS:
ollama serve
# On Windows:
# Start Ollama from Start Menu
# Check connectivity
curl http://localhost:11434/api/version
# If using Docker, check network
docker network inspect bridgeProblem: Model 'llama3.2' not found
Solutions:
# List available models
ollama list
# Pull required model
ollama pull llama3.2
ollama pull nomic-embed-text
# Verify model is available
ollama run llama3.2 "Hello"Problem: CUDA out of memory or System killed process
Solutions:
# Use smaller models
ollama pull llama3.2:3b # Instead of larger variants
# Reduce batch size in config
# Edit batch_indexing_config.json
{
"batch_size": 4, # Reduce from 10
"chunk_size": 256 # Reduce from 512
}
# Increase Docker memory limit
# Edit docker-compose.yml
services:
backend:
deploy:
resources:
limits:
memory: 8G # Increase from 4GProblem: Document indexing is very slow
Solutions:
# Increase parallel workers
# In create_index_script.py
MAX_WORKERS = 8 # Increase based on CPU cores
# Disable multimodal processing if not needed
python create_index_script.py --no-multimodal
# Use faster embedding model
# In config
"embedding_model": "all-MiniLM-L6-v2" # Faster than nomic-embed-textProblem: Frontend shows "Cannot connect to server"
Solutions:
# Check backend is running
curl http://localhost:5328/health
# Check CORS settings
# In backend/server.py
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:3000"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Check environment variables
# In frontend .env.local
NEXT_PUBLIC_API_URL=http://localhost:5328Problem: database is locked
Solutions:
# Close all connections to database
pkill -f "python.*server.py"
# Remove lock file
rm ./data/metadata.db-wal
rm ./data/metadata.db-shm
# Use Write-Ahead Logging
# In database.py
sqlite3.connect("metadata.db", check_same_thread=False, timeout=30)Problem: Unable to open vector store or corruption errors
Solutions:
# Backup and recreate vector store
mv data/vector_store data/vector_store.backup
# Recreate index
python create_index_script.py --index-name my_index --input-dir ./documents
# If backup needed
python -c "import lancedb; db = lancedb.connect('data/vector_store'); print(db.table_names())"# Backend tests
cd backend
python -m pytest test_backend.py -v
# Ollama connectivity test
python test_ollama_connectivity.py
# System health check
python system_health_check.py
# Docker build test
./test_docker_build.sh # Linux/macOS
test_docker_build.bat # Windows# Install coverage
pip install pytest-coverage
# Run with coverage
pytest --cov=backend --cov-report=html
# View coverage report
open htmlcov/index.htmlWe welcome contributions! Please follow these guidelines:
# Fork and clone repository
git clone https://github.com/YOUR_USERNAME/ollama-rag-docling.git
cd ollama-rag-docling
# Create feature branch
git checkout -b feature/your-feature-name
# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install# Format code with black
black .
# Lint with flake8
flake8 .
# Type check with mypy
mypy .
# Sort imports
isort .Follow conventional commits:
feat: Add new feature
fix: Bug fix
docs: Documentation changes
style: Code style changes
refactor: Code refactoring
test: Add tests
chore: Maintenance tasks
- Create Feature Branch
git checkout -b feature/amazing-feature- Make Changes and Test
# Make your changes
# Run tests
pytest
# Check code style
black --check .
flake8 .- Commit Changes
git add .
git commit -m "feat: Add amazing feature"- Push and Create PR
git push origin feature/amazing-feature
# Create pull request on GitHub- PR Requirements
- β All tests pass
- β Code follows style guidelines
- β Documentation updated
- β No merge conflicts
- β Descriptive PR title and description
# Use batch processing
BATCH_SIZE = 32 # Optimal for most systems
# Parallel processing
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=8) as executor:
results = executor.map(process_document, documents)
# Cache embeddings
from functools import lru_cache
@lru_cache(maxsize=1000)
def get_embedding(text):
return embedder.embed(text)# Use approximate search
lance_table.create_index(
metric="cosine",
num_partitions=256,
num_sub_vectors=96
)
# Limit result size
results = retriever.retrieve(query, top_k=10) # Instead of 100
# Use reranking selectively
if len(results) > 20:
results = reranker.rerank(results, top_k=5)# Use generators instead of lists
def process_documents():
for doc in documents:
yield process(doc)
# Clear cache periodically
import gc
gc.collect()
# Use memory-mapped files for large datasets
import mmap# logging_config.py
import logging
from logging.handlers import RotatingFileHandler
def setup_logging():
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
RotatingFileHandler(
'logs/system.log',
maxBytes=10485760, # 10MB
backupCount=5
),
logging.StreamHandler()
]
)# system_health_check.py provides:
- Ollama connectivity
- Model availability
- Vector store status
- Database health
- Memory usage
- Disk space
# Run health check
python system_health_check.py# Add metrics tracking
from prometheus_client import Counter, Histogram
query_counter = Counter('queries_total', 'Total queries')
query_latency = Histogram('query_latency_seconds', 'Query latency')
@query_latency.time()
def process_query(query):
query_counter.inc()
# Process query- API Security
# Add API authentication
from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer
security = HTTPBearer()
@app.post("/query")
async def query(credentials: HTTPBearer = Depends(security)):
# Verify token
if not verify_token(credentials.credentials):
raise HTTPException(status_code=401)- Input Validation
from pydantic import BaseModel, validator
class QueryRequest(BaseModel):
query: str
index_name: str
@validator('query')
def query_not_empty(cls, v):
if not v.strip():
raise ValueError('Query cannot be empty')
return v- Rate Limiting
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
@app.post("/query")
@limiter.limit("10/minute")
async def query(request: Request):
# Process query- Data Sanitization
import bleach
def sanitize_input(text: str) -> str:
return bleach.clean(text, strip=True)This project is licensed under the MIT License - see the LICENSE file for details.
- Ollama - Local LLM inference
- Docling - Document processing
- LanceDB - Vector database
- Sentence Transformers - Embedding models
- FastAPI - Web framework
- Next.js - Frontend framework
- Documentation: Check the Documentation folder
- Issues: Report bugs on GitHub Issues
- Discussions: Join GitHub Discussions
Q: Can I use this with OpenAI instead of Ollama?
A: Yes, modify ollama_client.py to use OpenAI API.
Q: What's the maximum document size? A: Depends on available memory. Tested up to 500MB PDFs.
Q: Can I deploy this on ARM devices (Raspberry Pi)? A: Yes, but performance will be limited. Use smaller models.
Q: How do I backup my indices?
A: Copy the data/ directory containing vector store and metadata DB.
Q: Can I use GPU acceleration? A: Yes, install PyTorch with CUDA support and Ollama will use GPU automatically.
- Multi-user support with authentication
- Advanced query analytics dashboard
- Support for more document formats
- Improved graph-based retrieval
- Real-time collaborative indexing
- Mobile application
- Cloud storage integration
- Advanced visualization tools
- Docling Integration Guide
- Docling Features
- Docling Recommendations
- Docker Usage Guide
- Docker Troubleshooting
- Contributing Guidelines
- Ollama Documentation
- LanceDB Documentation
- Sentence Transformers
- FastAPI Documentation
- Next.js Documentation
Built with β€οΈ by the Ollama RAG Docling Team
Last Updated: January 2025


