Medical RAG System

This repository contains a comprehensive implementation of a Medical Retrieval-Augmented Generation (RAG) system. The system integrates multiple components for document retrieval, question answering, and evaluation, tailored specifically for the medical domain.

Overview

The Medical RAG System is designed to enhance medical information retrieval and provide accurate answers to medical queries. It combines various retrieval methods, including BM25, bioBERT, and hybrid models, with advanced question-answering techniques to ensure precise and relevant results.

File structure

├── evaluation
│   ├── evaluation_data_storages
│   │   ├── documentation.md
│   │   ├── elasticsearch
│   │   │   ├── elastic.ipynb
│   │   │   └── eval_elastic.ipynb
│   │   ├── faiss
│   │   │   ├── conncatinatior.py
│   │   │   ├── embedding_extractor.py
│   │   │   └── request.ipynb
│   │   └── mongodb
│   │       ├── eval_mongo.ipynb
│   │       └── mongoDB.ipynb
│   └── evaluation_QA_system
│       ├── dataset_filter
│       │   └── filter_data.ipynb
│       ├── evaluation_pipeline.ipynb
│       ├── explore_questions.ipynb
│       ├── full_text_evaluation.py
│       └── RAG_evaluator.py
├── information_retrieval
│   ├── document_encoding
│   │   ├── bioBERT_encoder.py
│   │   ├── encode_documents.ipynb
│   │   └── medCPT_encoder.py
│   ├── elastic_container
│   │   ├── elastic.ipynb
│   │   ├── ingest_data.py
│   │   └── start_elasticsearch.sh
│   └── faiss_container
│       ├── docker-compose.yml
│       ├── Dockerfile
│       ├── faiss_insert_data.ipynb
│       └── server.py
├── rag_system
│   ├── bioBERT_encoder.py
│   ├── bioBERT_retriever.py
│   ├── bm25_retriever.py
│   ├── hybrid_retriever.py
│   ├── medCPT_encoder.py
│   ├── medCPT_retriever.py
│   ├── med_rag.py
│   ├── openAI_chat.py
│   └── pipeline.ipynb
├── README.md
├── requirements.txt
└── sys_requirements.txt

Installation

To set up the Medical RAG System, follow these steps:

Clone the Repository

git clone https://github.com/slinusc/medical_RAG_system.git
cd medical_RAG_system

Install Dependencies

Create a virtual environment and install the required packages:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
pip install -r requirements.txt

Download Pre-trained Models

Ensure that you download and set up any necessary pre-trained models (e.g., BioBERT, MedCPT).

Usage

The system can be used for different purposes, including document retrieval, question answering, and evaluation. Each component has its own set of instructions and example notebooks.

Retrieval System

BM25 Retriever: rag_system/bm25_retriever.py
BioBERT Retriever: rag_system/bioBERT_retriever.py
Hybrid Retriever: rag_system/hybrid_retriever.py

Question Answering System

Medical RAG: rag_system/med_rag.py
OpenAI Chat: rag_system/openAI_chat.py

Datasets

The 2.4M PubmEd subset we used can be accessed here: slinusc/PubMedAbstractsSubset. If you're looking for the precomputed embedding vectors (MedCPT) used in our work Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation, they are available in a separate dataset: slinusc/PubMedAbstractsSubsetEmbedded.

Evaluation

Evaluation scripts and notebooks are located in the evaluation/evaluation_QA_system/ directory. Example notebooks are provided to demonstrate the evaluation process.

Running an Evaluation

Filter the Data (Optional)

If you need to filter your dataset before evaluation, use the provided notebook:
```
evaluation/evaluation_QA_system/dataset_filter/filter_data.ipynb
```
Evaluate To run the evaluation pipeline, use the evaluation_pipeline.ipynb notebook located in the evaluation/evaluation_QA_system/ directory. This notebook provides a comprehensive guide and setup to evaluate the performance of the RAG system.

Used Infrastructure

The experiments were conducted on the following system:

Component	Specification
Architecture	x86_64
CPU	8 CPUs
Model	Intel Core Processor (Broadwell)
Memory	32 GiB total, 10 GiB used for buffers/cache
Storage	240 GiB disk size
Operating System	Ubuntu 22.04.4 LTS (Jammy)
Kernel Version	5.15.0-102-generic
GPU	NVIDIA A30

Contributing

We welcome contributions to enhance the Medical RAG System. Please follow these steps to contribute:

Fork the repository.
Create a new branch: git checkout -b feature-branch.
Make your changes and commit them: git commit -m 'Add new feature'.
Push to the branch: git push origin feature-branch.
Create a pull request.

📄 Publications

Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation
Linus Stuhlmann, Michael Saxer, Jonathan Fürst

Please cite our work:

@INPROCEEDINGS{11081505,
  author={Stuhlmann, Linus and Saxer, Michael Alexander and Fürst, Jonathan},
  booktitle={2025 IEEE Swiss Conference on Data Science (SDS)}, 
  title={Efficient and Reproducible Biomedical Question Answering Using Retrieval Augmented Generation}, 
  year={2025},
  volume={},
  number={},
  pages={154-157},
  keywords={Accuracy;Scalability;Large language models;Retrieval augmented generation;Data science;Information retrieval;Question answering (information retrieval);Hybrid power systems;Time factors;Indexing;Biomedical Information Retrieval;RetrievalAugmented Generation;Hybrid Retrieval;Large Language Models;PubMed;Information Retrieval Systems},
  doi={10.1109/SDS66131.2025.00029}}

Read the paper on arXiv

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Medical RAG System

Table of Contents

Overview

File structure

Installation

Usage

Retrieval System

Question Answering System

Datasets

Evaluation

Running an Evaluation

Used Infrastructure

Contributing

📄 Publications

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
evaluation		evaluation
information_retrieval		information_retrieval
rag_system		rag_system
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sys_requirements.txt		sys_requirements.txt

License

slinusc/medical_RAG_system

Folders and files

Latest commit

History

Repository files navigation

Medical RAG System

Table of Contents

Overview

File structure

Installation

Usage

Retrieval System

Question Answering System

Datasets

Evaluation

Running an Evaluation

Used Infrastructure

Contributing

📄 Publications

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages