Hybrid RAG Pipeline with Semantic Cache

A detailed overview for this application can be found here: Detailed Overview of the application

Overview

This project implements a Hybrid Retrieval-Augmented Generation (RAG) pipeline enhanced with Semantic Caching for efficient query handling.

The pipeline combines:

Query Rewriting to add contextual richness and preserve chat history.
Semantic Cache to avoid redundant retrieval when a similar query was already processed.
Hybrid Retrieval (dense + sparse) to fetch both semantic and keyword-based relevant chunks. This stage provides better recall - retrieves all the relevent docs (reduces FN)
Hierarchical Retrieval (child chunks → parent chunks → immediate parents).
Cross-Encoder Re-ranking to refine retrieved context. This stage is to provide high precision - keep the docs that are actuall relevant (ordered by relevancy score, reduces FP)
Context Relevance Checking to ensure high-quality responses.
LLM-based Response Generation to produce accurate and context-aware answers.

Workflow

The pipeline follows these steps:

User Message → Query Rewriting
- User query is rewritten to include context and chat history, making it more informative.
Semantic Cache Check
- The rewritten query is checked against the Redis Semantic Cache.
- If a cache hit is found → Extract context from cache.
- If no cache entry → Proceed to retrieval.
Hybrid Retrieval
- Compute dense + sparse embeddings for the query.
- Retrieve child chunks from Qdrant (vectorstore).
- Retrieve corresponding parent chunks from PostgreSQL (docstore).
- Append immediate parent/neighboring chunks for additional context.
Re-ranking
- Retrieved chunks are re-ranked with a Cross-Encoder model to prioritize the most relevant context.
Context Relevance Checking
- If the relevance score > 2 → Proceed to response generation.
- If relevance score ≤ 2 → Inform the user that no relevant context was found.
Response Generation
- Final response is generated using OpenAI with the user query + selected context.

Key Features

Semantic Cache with RedisVL
- Prevents duplicate retrieval for semantically similar queries.
- Reduces latency and computation cost.
Hybrid Retrieval
- Combines dense embeddings (semantic similarity via FastEmbed) and sparse retrieval (BM25) for robust performance.
Hierarchical Context Expansion
- Ensures retrieved passages maintain semantic meaning by including parent and neighboring chunks.
Cross-Encoder Re-ranking
- Refines retrieved chunks for maximum relevance.
Context Relevance Checker
- Ensures the model only answers when reliable context is available.
Docstore + Object Storage Integration
- Parent chunks stored in PostgreSQL.
- Raw files managed in MinIO.

Tech Stack

PyMuPDF → Document loading & parsing
LangChain → Parent-child chunking for hierarchical retrieval
FastEmbed → Dense embeddings (Sentence Transformers under the hood)
Qdrant → Vector store for storing/retrieving child chunks
PostgreSQL → Docstore for storing parent chunks
Cross-Encoder → Context re-ranking for relevance
OpenAI → Response generation (LLM backend)
Redis Semantic Cache → Semantic caching of query-response pairs
MinIO → Object storage for raw documents

Usage

git clone https://github.com/sanafayyaz315/hybrid-rag
cd hybrid-rag

1. Enviroment Setup

Create a .env file in the root directory using the following template:

# --------------------
# OpenAI / Model Config
# --------------------
MODEL="gpt-4o-mini"
API_KEY="your_api_key_here"

# --------------------
# Embedding / Retrieval Models
# --------------------
DENSE_EMBEDDING_MODEL="sentence-transformers/all-MiniLM-L6-v2"
SPARSE_EMBEDDING_MODEL="Qdrant/bm25"
CROSS_ENCODER_MODEL="cross-encoder/ms-marco-MiniLM-L6-v2"

MAX_SEQ_LENGTH_EMBEDDING=512

# --------------------
# Temp Directory
# --------------------
TEMP_FILE_DOWNLOAD_DIR="/tmp/upload_docs"

# --------------------
# Qdrant Config
# --------------------
QDRANT_HOST="localhost"
QDRANT_PORT="6333"
COLLECTION="prod_collection"
COLLECTION_RESOURCES=""
DISTANCE="Cosine"
SPARSE_MODIFIER="idf"
UPSERT_BATCH_SIZE=500
DENSE_VECTOR_NAME="dense"
SPARSE_VECTOR_NAME="sparse"

# --------------------
# PostgreSQL Docstore
# --------------------
DOCSTORE_USER="docuser"
DOCSTORE_PASSWORD="docpass"
DOCSTORE_HOST="localhost"
DOCSTORE_PORT=5431
DOCSTORE_NAME="docstore"

# --------------------
# Prompt Templates
# --------------------
SYSTEM_PROMPT_PATH="../template/rag.txt"
REWRITE_QUERY_PROMPT_PATH="../template/rewrite_query.txt"
CONTEXT_RELEVANCE_PROMPT_PATH="../template/context_relevance.txt"

# --------------------
# Chunking Config
# --------------------
PARENT_CHUNK_SIZE=2000
PARENT_CHUNK_OVERLAP=250
CHILD_CHUNK_SIZE=400
CHILD_CHUNK_OVERLAP=100
GET_NEIGHBORS="True"

# --------------------
# MinIO Storage
# --------------------
MINIO_ENDPOINT="localhost:9000"
MINIO_ROOT_USER="admin"
MINIO_ROOT_PASSWORD="admin12345"
MINIO_BUCKET="uploaded-files"

# --------------------
# Redis Cache
# --------------------
REDIS_HOST="localhost"
REDIS_PORT=6379
REDIS_DB=0
REDIS_PASSWORD=None
INDEX_NAME="semantic_cache"
CACHE_TOP_K=1
DISTANCE_THRESHOLD=0.25
CACHE_TTL=86400

# --------------------
# Chainlit / Logging
# --------------------
CHAINLIT_AUTH_SECRET="replace_with_strong_secret"
CHAINLIT_DB_NAME="chainlit_db"
LOG_LEVEL="DEBUG"

2. Install Dependencies

Install all required packages:

pip install -r requirements.txt

3. Spin-up Containers for Qdrant, Postgres(docstore), Minio, Redis

docker compose -f database/docker-compose.yml up  -d

⚠️ Make sure Docker is installed and running before starting the services.

4. Run the Chainlit UI

cd frontend

cd frontend
PYTHONPATH=.. chainlit run app.py

5. Run the API Endpoints (Optional)

Running the API is not required for the Chainlit UI, but it exposes endpoints that can be consumed by other applications or services.

cd src
PYTHONPATH=.. python api/main.py

Service	Purpose	Host:Port	Credentials (from compose)
Chainlit UI	Project frontend	http://localhost:8000	–
Qdrant Dashboard	Vector database	http://localhost:6333/dashboard	–
PostgreSQL	Docstore (relational DB)	`localhost:5431`	`docuser / docpass`, DB: `docstore`
MinIO API	Object storage (S3 API)	`http://localhost:9000`	`admin / admin12345`
MinIO Console	Web dashboard for MinIO	http://localhost:9090	`admin / admin12345`
Redis	In-memory data store	`localhost:6379`	–
Redis Insight	Redis visualization UI	http://localhost:8002	–

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.chainlit		.chainlit
database		database
frontend		frontend
src		src
template		template
README.md		README.md
flowdiagram.png		flowdiagram.png
rag_demo.gif		rag_demo.gif
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hybrid RAG Pipeline with Semantic Cache

Overview

Workflow

Key Features

Tech Stack

Usage

1. Enviroment Setup

2. Install Dependencies

3. Spin-up Containers for Qdrant, Postgres(docstore), Minio, Redis

4. Run the Chainlit UI

5. Run the API Endpoints (Optional)

Demo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hybrid RAG Pipeline with Semantic Cache

Overview

Workflow

Key Features

Tech Stack

Usage

1. Enviroment Setup

2. Install Dependencies

3. Spin-up Containers for Qdrant, Postgres(docstore), Minio, Redis

4. Run the Chainlit UI

5. Run the API Endpoints (Optional)

Demo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages