MedAgents-Benchmark

MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

This repository contains the evaluation benchmark for medical question-answering agents.

Overview

Installation

Please install the dependencies using the requirements.txt file.

pip install -r requirements.txt

Put all the environment variables in the .env file.

Running Experiments

To run the baseline experiments:

Navigate to the respective baseline directory:
- baselines/MDAgents/
- baselines/MedAgents/
- baselines/MedPrompt/
Execute the experiment script:
```
./run_experiments_all.sh
```
For analyzing results and calculating error/success metrics, refer to misc.ipynb

Loading the Dataset

You can load the MedAgentsBench dataset directly from Hugging Face by using the following code:

from datasets import load_dataset

dataset = load_dataset("super-dainiu/medagents-benchmark", "MedQA")["test_hard"]  # or any other dataset

Dataset Statistics

The benchmark focuses on challenging medical questions, specifically selecting questions where models achieve less than 50% accuracy. The hard question distribution across tasks is:

Task	Number of Hard Questions
medqa	100
pubmedqa	100
medmcqa	100
medbullets	89
mmlu	73
mmlu-pro	100
afrimedqa	32
medexqa	100
medxpertqa-r	100
medxpertqa-u	100

All agent evaluations are conducted on this test_hard subset.

Original Datasets

This benchmark includes several medical question-answering datasets that have been preprocessed into a standardized format:

MedQA

Multiple choice questions from medical licensing exams
Contains train and test splits
4 answer options (A-D)
Sampled 50 questions for evaluation

PubMedQA

Questions based on PubMed abstracts
3 answer options (yes/no/maybe)
Questions combine context from abstracts with the original question
Sampled 50 questions for evaluation

MedMCQA

Single choice medical questions (filtered from multi-choice)
Uses dev set as test set
4 answer options (A-D)
Sampled 50 questions for evaluation

AfriMedQA

Multiple choice medical questions
Variable number of options (A-J)
Filtered to keep only single-answer MCQs
Sampled 50 questions for evaluation

MMLU (Medical Subset)

Filtered to include only medical/biology domains:
- Clinical knowledge
- Professional medicine
- College medicine
- Medical genetics
- Anatomy
- College biology
4 answer options (A-D)
Sampled 50 questions for evaluation

MMLU-Pro (Health Subset)

Professional-level questions filtered to health category
Includes domains like clinical knowledge, medicine, nutrition, anatomy
Variable number of options (most common: 10 options)
Sampled 50 questions for evaluation

MedBullets

Categorized into difficulty levels (easy/good/hard/bad)
Includes detailed explanations
Multiple choice format
Sampled 50 questions from hard set for evaluation

MedXpertQA-R

Medical expert reasoning questions
4 answer options (A-D)
Sampled 50 questions for evaluation

MedXpertQA-U

Medical expert understanding questions
4 answer options (A-D)
Sampled 50 questions for evaluation

MedExQA

Multiple-choice questions across additional five medical specialties
4 answer options (A-D)
Sampled 50 questions for evaluation

All datasets have been standardized to include:

Question text question
Answer options options
Correct answer answer_idx
Unique ID realidx

For example:

{
    "question": "You are called to assess a term newborn... What is the most likely diagnosis?",
    "options": {
        "A": "Oesophageal atresia no fistula",
        "B": "Iatrogenic oesophageal perforation", 
        "C": "Oesophageal stenosis",
        "D": "Common type oesophageal atresia with mucus plugging of the distal tracheoesophageal fistula",
        "E": "N/A"
    },
    "answer_idx": "A",
    "realidx": "0fd14a5dcafa4c3054ea03245a10aa1262fb88bf4906cfcec09f73bee06b163c"
}

Cite Us

@inproceedings{tang2025medagentsbench,
  title={MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning},
  author = {Tang, Xiangru and Shao, Daniel and Sohn, Jiwoong and Chen, Jiapeng and Zhang, Jiayi and Xiang, Jinyu and Wu, Fang and Zhao, Yilun and Wu, Chenglin and Shi, Wenqi and Cohan, Arman and Gerstein, Mark},
  journal = {arXiv preprint arXiv:2503.07459},
  year = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
baselines		baselines
data		data
output		output
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
preprocess.ipynb		preprocess.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedAgents-Benchmark

Overview

Installation

Running Experiments

Loading the Dataset

Dataset Statistics

Original Datasets

MedQA

PubMedQA

MedMCQA

AfriMedQA

MMLU (Medical Subset)

MMLU-Pro (Health Subset)

MedBullets

MedXpertQA-R

MedXpertQA-U

MedExQA

Cite Us

About

Releases

Packages

Contributors 3

Languages

License

gersteinlab/medagents-benchmark

Folders and files

Latest commit

History

Repository files navigation

MedAgents-Benchmark

Overview

Installation

Running Experiments

Loading the Dataset

Dataset Statistics

Original Datasets

MedQA

PubMedQA

MedMCQA

AfriMedQA

MMLU (Medical Subset)

MMLU-Pro (Health Subset)

MedBullets

MedXpertQA-R

MedXpertQA-U

MedExQA

Cite Us

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages