Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
238 changes: 238 additions & 0 deletions .ipynb_checkpoints/README-checkpoint.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
<div align="center">

# MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

[![Static Badge](https://img.shields.io/badge/arxiv-2501.18362-ff0000?style=for-the-badge&labelColor=000)](https://arxiv.org/abs/2501.18362) [![Static Badge](https://img.shields.io/badge/huggingface-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/datasets/TsinghuaC3I/MedXpertQA) [![Static Badge](https://img.shields.io/badge/leaderboard-steelblue?style=for-the-badge&logo=googlechrome&logoColor=ffffff)](https://medxpertqa.github.io) [![Static Badge](https://img.shields.io/badge/license-mit-teal?style=for-the-badge&labelColor=000)](https://github.com/TsinghuaC3I/MedXpertQA/blob/main/LICENSE)

</div>

<div align="center">
<p>
<a href="#news" style="text-decoration: none; font-weight: bold;">🔥 News</a> •
<a href="#overview" style="text-decoration: none; font-weight: bold;">📖 Overview</a> •
<a href="#features" style="text-decoration: none; font-weight: bold;">✨ Features</a> •
<a href="#leaderboard" style="text-decoration: none; font-weight: bold;">📊 Leaderboard</a>
</p>
<p>
<a href="#usage" style="text-decoration: none; font-weight: bold;">🔧 Usage</a> •
<a href="#contact" style="text-decoration: none; font-weight: bold;">📨 Contact</a> •
<a href="#citation" style="text-decoration: none; font-weight: bold;">🎈 Citation</a>
</p>
</div>

## 🔥News

- **🎉 [2025-05-06] MedXpertQA paper is accepted to [ICML 2025](https://icml.cc/Conferences/2025)!**
- **🛠️ [2025-04-08] MedXpertQA has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass)! Check out the [PR](https://github.com/open-compass/opencompass/pull/2002)!**
- **💻 [2025-02-28] We release the evaluation code! Check out the [Usage](#usage).**
- **🌟 [2025-02-20] [Leaderboard](https://medxpertqa.github.io) is on! Check out the results of o3-mini, DeepSeek-R1, and o1!**
- **🤗 [2025-02-09] We release the MedXpertQA [dataset](https://huggingface.co/datasets/TsinghuaC3I/MedXpertQA).**
- **🔥 [2025-01-31] We introduce [MedXpertQA](https://arxiv.org/abs/2501.18362), a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning!**

## 📖Overview

**MedXpertQA** includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, **MedXpertQA Text** for text medical evaluation and **MedXpertQA MM** for multimodal medical evaluation. The following figure presents an overview.
<details>
<summary>
More Details
</summary>
The left side illustrates the diverse data sources, image types, and question attributes.
The right side compares typical examples from MedXpertQA MM and a traditional benchmark (VQA-RAD).
</details>

<p align="center">
<img src="figs/overview.png" alt="Overview of MedXpertQA." width="90%">
</p>


## ✨Features

- **Next-Generation Multimodal Medical Evaluation:** MedXpertQA MM introduces expert-level medical exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions.
- **Highly Challenging:** MedXpertQA introduces high-difficulty medical exam questions and applies rigorous filtering and augmentation, effectively addressing the insufficient difficulty of existing benchmarks like MedQA. The Text and MM subsets are currently the most challenging benchmarks in their respective fields.
- **Clinical Relevance:** MedXpertQA incorporates specialty board questions to improve clinical relevance and comprehensiveness by collecting questions corresponding to 17/25 member board exams (specialties) of the American Board of Medical Specialties. It showcases remarkable diversity across multiple dimensions.

<p align="center">
<img src="figs/diversity.png" alt="MedXpertQA spans diverse human body systems, medical tasks, and question topics." width="90%">
</p>

- **Mitigating Data Leakage:** We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability.
- **Reasoning-Oriented Evaluation:** Medicine provides a rich and representative setting for assessing reasoning abilities beyond mathematics and code. We develop a reasoning-oriented subset to facilitate the assessment of o1-like models.

## 📊Leaderboard

We evaluate 17 leading proprietary and open-source LMMs and LLMs including advanced inference-time scaled models with a focus on the latest progress in medical reasoning capabilities.
**Further details are available in the [leaderboard](https://medxpertqa.github.io) and the [paper](https://arxiv.org/abs/2501.18362).**

<p align="center">
<img src="figs/leaderboard1.png" width="60%">
<img src="figs/leaderboard2.png" width="32.5%">
</p>


## 🔧Usage

1. Clone the Repository:

```
git clone https://github.com/TsinghuaC3I/MedXpertQA
cd MedXpertQA/eval
```

2. Install Dependencies:

```
pip3 install -r requirements.txt
```

3. Inference:

```
bash scripts/run.sh
```

> The *run.sh* script performs inference by calling *main.py*, which offers additional features such as multithreading. Additionally, you can modify *model/api_agent.py* to support more models.


## Set up for Custom Models

### Adding Your Own Model

To evaluate your custom model on MedXpertQA, follow these steps:

#### Step 1: Setup Dataset
```bash
cd eval/data
python setup_datasets.py
```

#### Step 2: Add Model to Configuration
Edit `config/model_info.json` and add your model name to the `API_MODEL` list:

```json
{
"API_MODEL": [
"existing-models...",
"your-custom-model-name",
"mistralai/Mistral-Small-3.2-24B-Instruct-2506",
"your-organization/your-model-name"
]
}
```

#### Step 3: Implement Model Interface
Add your model implementation in `model/api_agent.py`. Add a new condition in the `__init__` method:

```python
elif model_name in [
"your-custom-model-name",
"your-organization/your-model-name"
]:
print("Custom Model")
# For vLLM-based models:
from vllm import LLM, SamplingParams
import os

os.environ['VLLM_USE_V1'] = '0'

self.client = LLM(
model=model_name,
tensor_parallel_size=4,
worker_use_ray=False,
gpu_memory_utilization=0.75,
trust_remote_code=True
)
self.sampling_params = SamplingParams(
max_tokens=self.max_tokens,
temperature=self.temperature if self.temperature > 0 else 0.15
)
self.is_vllm = True

# For API-based models:
# self.client = OpenAI(
# api_key="your-api-key",
# base_url="your-api-endpoint",
# )
```

#### Step 4: Create Custom Evaluation Script
Create `scripts/run_custom.sh` for your model:

```bash
#!/bin/bash
set -e
models=${1:-"your-custom-model-name"}
datasets=${2:-"medxpertqa"}
tasks=${3:-"text,mm"}
output_dir=${4:-"dev"}
method=${5:-"zero_shot"}
prompting_type=${6:-"cot"}
temperature=${7:-0}

# Sequential execution to avoid memory issues
IFS=","
for model in $models; do
for dataset in $datasets; do
for task in $tasks; do
echo "Running: Model=${model}, Dataset=${dataset}, Task=${task}"

python main.py \
--model "${model}" \
--dataset "${dataset}" \
--task "${task}" \
--output-dir "${output_dir}" \
--method "${method}" \
--prompting-type "${prompting_type}" \
--temperature "${temperature}" \
--num-threads 1

# Clean GPU memory between tasks
python -c "import torch; torch.cuda.empty_cache()"
sleep 2
done
done
done
```

#### Step 5: Run Evaluation
```bash
# Make script executable
chmod +x scripts/run_custom.sh

# Run evaluation
bash scripts/run_custom.sh "your-custom-model-name"

```



4. Evaluation:

We provide a script *eval.ipynb* to calculate accuracy on each subset.

> [!NOTE]
> Please use this script when evaluating the **QVQ** and **DeepSeek-R1**. Through case studies, we found that the answer cleaning function in the *utils.py* is unsuitable for these two models.



## 📨Contact

- Shang Qu: [lindsay2864tt@gmail.com](mailto:lindsay2864tt@gmail.com)

- Ning Ding: [dn97@mail.tsinghua.edu.cn](mailto:dn97@mail.tsinghua.edu.cn)

## ⚖️License

This project is licensed under the [MIT License](https://github.com/TsinghuaC3I/MedXpertQA/blob/main/LICENSE).

## 🎈Citation

If you find our work helpful, please use the following citation.

```bibtex
@article{zuo2025medxpertqa,
title={Medxpertqa: Benchmarking expert-level medical reasoning and understanding},
author={Zuo, Yuxin and Qu, Shang and Li, Yifei and Chen, Zhangren and Zhu, Xuekai and Hua, Ermo and Zhang, Kaiyan and Ding, Ning and Zhou, Bowen},
journal={arXiv preprint arXiv:2501.18362},
year={2025}
}
```
115 changes: 115 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,13 +92,128 @@ bash scripts/run.sh

> The *run.sh* script performs inference by calling *main.py*, which offers additional features such as multithreading. Additionally, you can modify *model/api_agent.py* to support more models.


## Set up for Custom Models

### Adding Your Own Model

To evaluate your custom model on MedXpertQA, follow these steps:

#### Step 1: Setup Dataset
```bash
cd eval/data
python setup_datasets.py
```

#### Step 2: Add Model to Configuration
Edit `config/model_info.json` and add your model name to the `API_MODEL` list:

```json
{
"API_MODEL": [
"existing-models...",
"your-custom-model-name",
"mistralai/Mistral-Small-3.2-24B-Instruct-2506",
"your-organization/your-model-name"
]
}
```

#### Step 3: Implement Model Interface
Add your model implementation in `model/api_agent.py`. Add a new condition in the `__init__` method:

```python
elif model_name in [
"your-custom-model-name",
"your-organization/your-model-name"
]:
print("Custom Model")
# For vLLM-based models:
from vllm import LLM, SamplingParams
import os

os.environ['VLLM_USE_V1'] = '0'

self.client = LLM(
model=model_name,
tensor_parallel_size=4,
worker_use_ray=False,
gpu_memory_utilization=0.75,
trust_remote_code=True
)
self.sampling_params = SamplingParams(
max_tokens=self.max_tokens,
temperature=self.temperature if self.temperature > 0 else 0.15
)
self.is_vllm = True

# For API-based models:
# self.client = OpenAI(
# api_key="your-api-key",
# base_url="your-api-endpoint",
# )
```

#### Step 4: Create Custom Evaluation Script
Create `scripts/run_custom.sh` for your model:

```bash
#!/bin/bash
set -e
models=${1:-"your-custom-model-name"}
datasets=${2:-"medxpertqa"}
tasks=${3:-"text,mm"}
output_dir=${4:-"dev"}
method=${5:-"zero_shot"}
prompting_type=${6:-"cot"}
temperature=${7:-0}

# Sequential execution to avoid memory issues
IFS=","
for model in $models; do
for dataset in $datasets; do
for task in $tasks; do
echo "Running: Model=${model}, Dataset=${dataset}, Task=${task}"

python main.py \
--model "${model}" \
--dataset "${dataset}" \
--task "${task}" \
--output-dir "${output_dir}" \
--method "${method}" \
--prompting-type "${prompting_type}" \
--temperature "${temperature}" \
--num-threads 1

# Clean GPU memory between tasks
python -c "import torch; torch.cuda.empty_cache()"
sleep 2
done
done
done
```

#### Step 5: Run Evaluation
```bash
# Make script executable
chmod +x scripts/run_custom.sh

# Run evaluation
bash scripts/run_custom.sh "your-custom-model-name"

```



4. Evaluation:

We provide a script *eval.ipynb* to calculate accuracy on each subset.

> [!NOTE]
> Please use this script when evaluating the **QVQ** and **DeepSeek-R1**. Through case studies, we found that the answer cleaning function in the *utils.py* is unsuitable for these two models.



## 📨Contact

- Shang Qu: [lindsay2864tt@gmail.com](mailto:lindsay2864tt@gmail.com)
Expand Down
Loading