This project focuses on developing and evaluating an advanced algorithm for optimizing text chunking in Retrieval Augmented Generation (RAG) pipelines. The core idea is to create semantically coherent chunks to improve context retrieval, question generation, and final answer quality, moving beyond standard fixed-size or recursive chunking methods.
C:/Users/shour/OneDrive - vit.ac.in/chunk-optimization/
├───.gitignore
├───chunking.py
├───process_scores.ipynb
├───README.md
├───requirements.txt
├───scoring.ipynb
├───.git/
├───Eval/
│ ├───answer_eval.py
│ ├───chunk_eval.py
│ ├───chunks.csv
│ ├───context_eval.py
│ ├───context_rich_chunks.csv
│ ├───create_chunks.py
│ ├───evals.csv
│ ├───evaluation.ipynb
│ ├───generate_questions.py
│ ├───generations.csv
│ ├───iou-scores.csv
│ ├───prompts.py
│ ├───question_eval.py
│ ├───question-evals.csv
│ ├───questions.csv
│ ├───ratings.csv
│ └───response.py
├───insurance-information/
│ ├───Draft IRDAI(Protection of Policyholders’ Interests and Allied Matters of Insurers) Regulations, 2024.pdf
│ ├───Insurance Act,1938 - incorporating all amendments till 20212021-08-12.pdf
│ └───Life Insurance Handbook (English).pdf
├───optimization/
│ └───src/
│ ├───optimizer.py
│ └───utils.py
└───rag_eval/
└───collection/
The evaluation pipeline is a comprehensive, multi-step process designed to rigorously assess the performance of the optimized chunking strategy against standard methods.
-
Chunk Creation (
Eval/create_chunks.py): The process begins by generating chunks from the source documents using various standard methods (e.g.,CharacterTextSplitter,RecursiveCharacterTextSplitter). -
Context-Richness Evaluation (
Eval/chunk_eval.py): Each chunk is evaluated to determine if it contains sufficient context to form a meaningful question. A score is assigned, and only chunks that pass a certain threshold (i.e., are "context-rich") proceed to the next stage. -
Question Generation (
Eval/generate_questions.py): For each context-rich chunk, a set of questions is generated using an LLM. These questions are designed to be answerable using the information contained within that specific chunk. -
Question Quality Evaluation (
Eval/question_eval.py): The generated questions are then evaluated based on their standalone quality, relevance to the context, and how well-grounded they are in the provided text. This ensures that only high-quality questions are used for the downstream evaluation tasks. -
Base vs. Optimized Chunking (
chunking.py): This script prepares the two main sets of chunks for the final evaluation:- Base Chunks: Created using standard chunking methods.
- Optimized Chunks: Created by applying the custom optimization algorithm to the base chunks.
-
Answer Generation (
Eval/response.py): For each high-quality question, answers are generated using two different language models:gemma-3-1b-itandllama3-8b-instruct. This is done for both base and optimized chunking strategies, retrieving the 3 and 5 most relevant chunks as context. -
Answer Evaluation (
Eval/answer_eval.py): The generated answers are evaluated using a powerful LLM-as-a-Judge model (gemma-3-4b-it) to score their quality, relevance, and accuracy based on the original context. -
Intersection over Union (IoU) Score: An IoU score is calculated to measure the similarity between the source chunk (from which the question was generated) and the retrieved chunks. This evaluates the efficiency of the retrieval process for each chunking method.
src/utils.py: Contains the core utility functions that power the optimization algorithm. This includes functions for calculating semantic similarity between chunks, merging adjacent chunks, and intelligently splitting large chunks at points of low semantic cohesion.src/optimizer.py: Implements theChunkOptimizerclass. This class takes an initial set of chunks and applies an iterative process of merging and splitting based on semantic similarity thresholds to produce a final, optimized set of chunks.
create_chunks.py: Reads PDF documents, preprocesses the text, and uses variouslangchaintext splitters to create the initial set of chunks for evaluation.chunk_eval.py: Uses an LLM to score each chunk on its "context richness" to filter out chunks that are not suitable for generating high-quality questions.generate_questions.py: Takes the context-rich chunks and uses an LLM to generate relevant, context-based questions for each one.question_eval.py: Employs an LLM to assess the generated questions on three criteria: standalone quality, relevance, and groundedness in the source chunk.response.py: For a given question, this script retrieves context from a vector store (for both base and optimized chunks) and generates an answer using specified LLMs (gemma-3-1b-it,llama3-8b-instruct).answer_eval.py: Uses a powerful LLM (gemma-3-4b-it) as a judge to score the quality of the answers generated byresponse.py.
chunking.py: Orchestrates the creation of the base and optimized chunks that are stored in the vector database for the final RAG evaluation.
The following images summarize the results of the evaluation, comparing the performance of optimized chunking against standard methods.
This chart displays the evaluation scores for chunks based on their context richness and suitability for question generation. The comparison is made between different chunking strategies, retrieving the 3 and 5 closest chunks.
This image shows the quality of answers generated by the gemma-3-1b-it model. The scores compare the performance when using base chunking versus the optimized chunking strategy.
This image shows the quality of answers generated by the llama3-8b-instruct model, comparing the results from base and optimized chunks.
This chart presents the IoU scores, which measure how effectively the retrieval step fetches the correct source chunk. Higher scores indicate more precise context retrieval.



