VisionRAG: Acing DMV test with Multimodal Retrieval Augmented Generation

Stanford NLP Project Repo. VLM RAG pipeline based on CoPali.

Traditional text-based retrieval struggles with visually rich documents, where key information is embedded in layouts, tables, and figures. Text-based approaches often lose visual context, leading to degraded retrieval accuracy. To address this, we propose a Vision-based Retrieval-Augmented Generation (Vision-RAG) framework that directly processes document images using Vision Language Models (VLMs), bypassing OCR and preserving both textual and visual context. In this project, we implement and evaluate a Vision-RAG pipeline, where a vision-based retriever selects relevant document images and a generative model answers user queries. We conduct experiments on the ViDoRe benchmark, a dataset specifically designed for multimodal document retrieval and our custom dataset, and a testset containing real multiple choice practice questions related to driver's liscence test. Key findings include:

We implement and evaluate a Vision-RAG pipeline that directly processes document images without parsing, avoiding OCR-related errors and preserving visual context.
We compare our Vision-RAG approach against text-based retrieval baselines and demonstrate its robustness on text-based documents and superiority in understanding document's visual information.
We fine-tune a lightweight VLM retriever using contrastive learning with LoRA, further boosting retrieval and end-to-end performance.
We incorporate query expansion and Chain-of-Thought (CoT) reasoning to refine retrieval quality and improve response coherence.

Pipeline:

Interpretable MaxSim Mapping:

Query: What is the hand-and-arm signal used for tuning right while driving?

Token-level MaxSim-Score

The attention graph for token-level MaxSim calculated from query "What is the hand-and-arm signal used for turning right while driving?" The document page is encoded as a sequence of 21 * 34 patches, and each word in the query represents a token. Highlighted part denotes higher MaxSim score.

Project Structure Tree:

VLM RAG/
│
├── benchmark_run_metrics/        # ranking metrics for benchmark
│   └── datasetName/
│       └── metrics.json           
│
├── codes/
│   ├── finetune.py               # script for fine-tuning retriever using contrastive learning
│   ├── run_benchmark.py          # script to run model on benchmark
│   ├── interpretability.py       # script for attention visualization
│   └── utils                     # util functions
│
├── interpreted_output            # heatmap visualizing visual attention   
|
├── trained_models_checkpoint/
│   └── model files
|
├── main/                         # main rag pipeline
│   ├── dbManager.py              # script for article vectorization
│   ├── gen.py                    # script for inference and synthetic question generation
│   ├── preprocessor.py           # script for doc preprocessing
│   ├── get_data.py               # scraper for evaluation set
│   └── pipeline.py               # script for RAG pipeline
│
├── dmv_example.png               # example image used for interpretable similarity mapping  
|
├── requirements.txt              # Python dependencies

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
benchmark_run_metrics		benchmark_run_metrics
codes		codes
interpreted_output		interpreted_output
interpreted_output_new		interpreted_output_new
main		main
trained_models_checkpoint		trained_models_checkpoint
.gitignore		.gitignore
README.md		README.md
intepretable_example.png		intepretable_example.png
pipeline.png		pipeline.png
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VisionRAG: Acing DMV test with Multimodal Retrieval Augmented Generation

Pipeline:

Interpretable MaxSim Mapping:

Project Structure Tree:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

K0EKJE/VLM-Based-Retrieval-Augmented-Generation

Folders and files

Latest commit

History

Repository files navigation

VisionRAG: Acing DMV test with Multimodal Retrieval Augmented Generation

Pipeline:

Interpretable MaxSim Mapping:

Project Structure Tree:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages