Chunk Optimization for RAG

This project focuses on developing and evaluating an advanced algorithm for optimizing text chunking in Retrieval Augmented Generation (RAG) pipelines. The core idea is to create semantically coherent chunks to improve context retrieval, question generation, and final answer quality, moving beyond standard fixed-size or recursive chunking methods.

Folder Structure

C:/Users/shour/OneDrive - vit.ac.in/chunk-optimization/
├───.gitignore
├───chunking.py
├───process_scores.ipynb
├───README.md
├───requirements.txt
├───scoring.ipynb
├───.git/
├───Eval/
│   ├───answer_eval.py
│   ├───chunk_eval.py
│   ├───chunks.csv
│   ├───context_eval.py
│   ├───context_rich_chunks.csv
│   ├───create_chunks.py
│   ├───evals.csv
│   ├───evaluation.ipynb
│   ├───generate_questions.py
│   ├───generations.csv
│   ├───iou-scores.csv
│   ├───prompts.py
│   ├───question_eval.py
│   ├───question-evals.csv
│   ├───questions.csv
│   ├───ratings.csv
│   └───response.py
├───insurance-information/
│   ├───Draft IRDAI(Protection of Policyholders’ Interests and Allied Matters of Insurers) Regulations, 2024.pdf
│   ├───Insurance Act,1938 - incorporating all amendments till 20212021-08-12.pdf
│   └───Life Insurance Handbook (English).pdf
├───optimization/
│   └───src/
│       ├───optimizer.py
│       └───utils.py
└───rag_eval/
    └───collection/

Evaluation Process

The evaluation pipeline is a comprehensive, multi-step process designed to rigorously assess the performance of the optimized chunking strategy against standard methods.

Chunk Creation (Eval/create_chunks.py): The process begins by generating chunks from the source documents using various standard methods (e.g., CharacterTextSplitter, RecursiveCharacterTextSplitter).
Context-Richness Evaluation (Eval/chunk_eval.py): Each chunk is evaluated to determine if it contains sufficient context to form a meaningful question. A score is assigned, and only chunks that pass a certain threshold (i.e., are "context-rich") proceed to the next stage.
Question Generation (Eval/generate_questions.py): For each context-rich chunk, a set of questions is generated using an LLM. These questions are designed to be answerable using the information contained within that specific chunk.
Question Quality Evaluation (Eval/question_eval.py): The generated questions are then evaluated based on their standalone quality, relevance to the context, and how well-grounded they are in the provided text. This ensures that only high-quality questions are used for the downstream evaluation tasks.
Base vs. Optimized Chunking (chunking.py): This script prepares the two main sets of chunks for the final evaluation:
- Base Chunks: Created using standard chunking methods.
- Optimized Chunks: Created by applying the custom optimization algorithm to the base chunks.
Answer Generation (Eval/response.py): For each high-quality question, answers are generated using two different language models: gemma-3-1b-it and llama3-8b-instruct. This is done for both base and optimized chunking strategies, retrieving the 3 and 5 most relevant chunks as context.
Answer Evaluation (Eval/answer_eval.py): The generated answers are evaluated using a powerful LLM-as-a-Judge model (gemma-3-4b-it) to score their quality, relevance, and accuracy based on the original context.
Intersection over Union (IoU) Score: An IoU score is calculated to measure the similarity between the source chunk (from which the question was generated) and the retrieved chunks. This evaluates the efficiency of the retrieval process for each chunking method.

Code File Descriptions

`optimization/`

src/utils.py: Contains the core utility functions that power the optimization algorithm. This includes functions for calculating semantic similarity between chunks, merging adjacent chunks, and intelligently splitting large chunks at points of low semantic cohesion.
src/optimizer.py: Implements the ChunkOptimizer class. This class takes an initial set of chunks and applies an iterative process of merging and splitting based on semantic similarity thresholds to produce a final, optimized set of chunks.

`Eval/`

create_chunks.py: Reads PDF documents, preprocesses the text, and uses various langchain text splitters to create the initial set of chunks for evaluation.
chunk_eval.py: Uses an LLM to score each chunk on its "context richness" to filter out chunks that are not suitable for generating high-quality questions.
generate_questions.py: Takes the context-rich chunks and uses an LLM to generate relevant, context-based questions for each one.
question_eval.py: Employs an LLM to assess the generated questions on three criteria: standalone quality, relevance, and groundedness in the source chunk.
response.py: For a given question, this script retrieves context from a vector store (for both base and optimized chunks) and generates an answer using specified LLMs (gemma-3-1b-it, llama3-8b-instruct).
answer_eval.py: Uses a powerful LLM (gemma-3-4b-it) as a judge to score the quality of the answers generated by response.py.

Root

chunking.py: Orchestrates the creation of the base and optimized chunks that are stored in the vector database for the final RAG evaluation.

Results

The following images summarize the results of the evaluation, comparing the performance of optimized chunking against standard methods.

Chunk Evaluation Scores

This chart displays the evaluation scores for chunks based on their context richness and suitability for question generation. The comparison is made between different chunking strategies, retrieving the 3 and 5 closest chunks.

Answer Quality: Gemma-3-1B-IT

This image shows the quality of answers generated by the gemma-3-1b-it model. The scores compare the performance when using base chunking versus the optimized chunking strategy.

Answer Quality: Llama3-8B-Instruct

This image shows the quality of answers generated by the llama3-8b-instruct model, comparing the results from base and optimized chunks.

Intersection over Union (IoU) Scores

This chart presents the IoU scores, which measure how effectively the retrieval step fetches the correct source chunk. Higher scores indicate more precise context retrieval.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chunk Optimization for RAG

Folder Structure

Evaluation Process

Code File Descriptions

`optimization/`

`Eval/`

Root

Results

Chunk Evaluation Scores

Answer Quality: Gemma-3-1B-IT

Answer Quality: Llama3-8B-Instruct

Intersection over Union (IoU) Scores

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Eval		Eval
insurance-information		insurance-information
optimization/src		optimization/src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Chunk Optimization for RAG

Folder Structure

Evaluation Process

Code File Descriptions

optimization/

Eval/

Root

Results

Chunk Evaluation Scores

Answer Quality: Gemma-3-1B-IT

Answer Quality: Llama3-8B-Instruct

Intersection over Union (IoU) Scores

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`optimization/`

`Eval/`

Packages