MIMIC-CXR-VQA

A new collection of medical visual question answering dataset on MIMIC-CXR database

Overview

The MIMIC-CXR-VQA dataset is a complex (involving set and logical operations), diverse (with 48 templates), and large-scale (approximately 377K) resource, designed specifically for Visual Question Answering (VQA) tasks in the medical domain. Primarily focusing on chest radiographs, this dataset was mainly derived from the MIMIC-CXR-JPG and Chest ImaGenome datasets, both of which were sourced from Physionet.

The goal of the MIMIC-CXR-VQA dataset is to serve as a benchmark for evaluating the effectiveness of current medical VQA approaches. It not only functions as a tool for traditional medical VQA tasks but also has the unique quality of being an image-based Electronic Health Records (EHRs) Question Answering dataset resource. Therefore, we utilize question templates from the MIMIC-CXR-VQA dataset as seed question templates for image modality, to construct a multi-modal EHR QA dataset, EHRXQA.

Updates

[07/20/2024] We released MIMIC-CXR-VQA dataset on Physionet.
[12/12/2023] We presented our research work at NeurIPS 2023 Datasets and Benchmarks Track as a poster.
[10/28/2023] We released our research paper on arXiv.

Reproducing the Dataset

Prerequisites

Python 3.12+
PhysioNet credentialed account with signed DUAs for:

New to PhysioNet? Click to see credentialing instructions

Register for a PhysioNet account
Follow the credentialing instructions
Complete the CITI Data or Specimens Only Research training course
Sign the DUA for each required dataset (links above)

Environment Setup

Using UV (Recommended)

# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/baeseongsu/mimic-cxr-vqa.git
cd mimic-cxr-vqa

# Create environment and install dependencies
uv venv --python 3.12
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install pandas tqdm scikit-learn

Using Conda

# Clone repository
git clone https://github.com/baeseongsu/mimic-cxr-vqa.git
cd mimic-cxr-vqa

# Create environment and install dependencies
conda create -n mimiccxrvqa python=3.12
conda activate mimiccxrvqa
pip install pandas tqdm scikit-learn

Running the Reproduction Script

Option 1: Build from pre-downloaded datasets

If you have already downloaded MIMIC-CXR, MIMIC-IV, and Chest ImaGenome, ensure the directory paths in the script match your local setup, then run:

bash build_dataset.sh

Option 2: Download and build

To download source datasets from PhysioNet and generate the dataset:

bash download_and_build_dataset.sh

When prompted, enter your PhysioNet credentials (password will not be displayed).

What these scripts do:

Download source datasets from PhysioNet (MIMIC-CXR-JPG, Chest ImaGenome, MIMIC-IV)
Preprocess the datasets
Generate complete MIMIC-CXR-VQA dataset with ground-truth answers and metadata

Downloading Images Only

To download only the CXR images relevant to MIMIC-CXR-VQA (rather than all MIMIC-CXR-JPG images):

bash download_images.sh

This script reads image paths from the dataset JSON files and downloads only the required images from PhysioNet.

Dataset Structure

mimiccxrvqa/
└── dataset/
    ├── ans2idx.json          # Answer to index mapping
    ├── _train_part1.json     # Pre-release (without answers)
    ├── _train_part2.json     # Pre-release (without answers)
    ├── _valid.json           # Pre-release (without answers)
    ├── _test.json            # Pre-release (without answers)
    ├── train.json            # Generated after running script
    ├── valid.json            # Generated after running script
    └── test.json             # Generated after running script

Pre-release files (_*.json) are intentionally incomplete to safeguard privacy. Complete files with answers and metadata are generated after running the reproduction script with valid PhysioNet credentials.

Dataset Schema

Each QA sample is a JSON object with the following fields:

Core Fields:

split: Dataset split (train/valid/test)
idx: Instance index
image_id: Associated image ID
question: Natural language question
answer: Answer string (generated by script)

Template Fields:

content_type: Content category (anatomy, attribute, presence, abnormality, plane, gender, size)
semantic_type: Question type (verify, choose, query)
template: Question template
template_program: Program to generate answer from database
template_arguments: Template argument values (object, attribute, category, viewpos, gender)

Metadata (generated by script):

subject_id: Patient ID
study_id: Study ID
image_path: Image file path

Example:

{
    "split": "train",
    "idx": 13280,
    "image_id": "34c81443-5a19ccad-7b5e431c-4e1dbb28-42a325c0",
    "question": "Are there signs of both pleural effusion and lung cancer in the left lower lung zone?",
    "content_type": "attribute",
    "semantic_type": "verify",
    "template": "Are there signs of both ${attribute_1} and ${attribute_2} in the ${object}?",
    "template_program": "program_5",
    "template_arguments": {
        "object": {"0": "left lower lung zone"},
        "attribute": {"0": "pleural effusion", "1": "lung cancer"}
    },
    "answer": "no",
    "subject_id": "10000032",
    "study_id": "50414267",
    "image_path": "files/p10/p10000032/s50414267/34c81443-5a19ccad-7b5e431c-4e1dbb28-42a325c0.jpg"
}

Version

Current: v1.0.0

This project uses semantic versioning. For detailed changes, see CHANGELOG.

Citation

When you use the MIMIC-CXR-VQA dataset, we would appreciate it if you cite the following:

@article{bae2024ehrxqa,
  title={EHRXQA: A multi-modal question answering dataset for electronic health records with chest x-ray images},
  author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric and Kim, Tackeun and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

License

The code in this repository is provided under the terms of the MIT License. The final output dataset (MIMIC-CXR-VQA) is subject to the terms and conditions of the original datasets from Physionet: MIMIC-CXR-JPG License, Chest ImaGenome License, and MIMIC-IV License.

Contact

For questions or concerns regarding this dataset, please contact:

Seongsu Bae ([email protected])
Daeun Kyung ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
dataset_builder		dataset_builder
mimiccxrvqa/dataset		mimiccxrvqa/dataset
physionet.org/files		physionet.org/files
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
build_dataset.sh		build_dataset.sh
download_and_build_dataset.sh		download_and_build_dataset.sh
download_images.sh		download_images.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIMIC-CXR-VQA

Overview

Updates

Reproducing the Dataset

Prerequisites

Environment Setup

Running the Reproduction Script

Version

Citation

License

Contact

About

Uh oh!

Releases 3

Packages

Languages

License

baeseongsu/mimic-cxr-vqa

Folders and files

Latest commit

History

Repository files navigation

MIMIC-CXR-VQA

Overview

Updates

Reproducing the Dataset

Prerequisites

Environment Setup

Running the Reproduction Script

Version

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages