Skip to content

A new collection of medical VQA dataset based on MIMIC-CXR. Part of the work 'EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images'. (NeurIPS 2023 D&B)

License

Notifications You must be signed in to change notification settings

baeseongsu/mimic-cxr-vqa

Repository files navigation

MIMIC-CXR-VQA

License:Physionet GitHub release GitHub last commit Code style: black

A new collection of medical visual question answering dataset on MIMIC-CXR database

Overview

The MIMIC-CXR-VQA dataset is a complex (involving set and logical operations), diverse (with 48 templates), and large-scale (approximately 377K) resource, designed specifically for Visual Question Answering (VQA) tasks in the medical domain. Primarily focusing on chest radiographs, this dataset was mainly derived from the MIMIC-CXR-JPG and Chest ImaGenome datasets, both of which were sourced from Physionet.

The goal of the MIMIC-CXR-VQA dataset is to serve as a benchmark for evaluating the effectiveness of current medical VQA approaches. It not only functions as a tool for traditional medical VQA tasks but also has the unique quality of being an image-based Electronic Health Records (EHRs) Question Answering dataset resource. Therefore, we utilize question templates from the MIMIC-CXR-VQA dataset as seed question templates for image modality, to construct a multi-modal EHR QA dataset, EHRXQA.

Updates

  • [07/20/2024] We released MIMIC-CXR-VQA dataset on Physionet.
  • [12/12/2023] We presented our research work at NeurIPS 2023 Datasets and Benchmarks Track as a poster.
  • [10/28/2023] We released our research paper on arXiv.

Reproducing the Dataset

Prerequisites

New to PhysioNet? Click to see credentialing instructions
  1. Register for a PhysioNet account
  2. Follow the credentialing instructions
  3. Complete the CITI Data or Specimens Only Research training course
  4. Sign the DUA for each required dataset (links above)

Environment Setup

Using UV (Recommended)
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/baeseongsu/mimic-cxr-vqa.git
cd mimic-cxr-vqa

# Create environment and install dependencies
uv venv --python 3.12
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install pandas tqdm scikit-learn
Using Conda
# Clone repository
git clone https://github.com/baeseongsu/mimic-cxr-vqa.git
cd mimic-cxr-vqa

# Create environment and install dependencies
conda create -n mimiccxrvqa python=3.12
conda activate mimiccxrvqa
pip install pandas tqdm scikit-learn

Running the Reproduction Script

Option 1: Build from pre-downloaded datasets

If you have already downloaded MIMIC-CXR, MIMIC-IV, and Chest ImaGenome, ensure the directory paths in the script match your local setup, then run:

bash build_dataset.sh

Option 2: Download and build

To download source datasets from PhysioNet and generate the dataset:

bash download_and_build_dataset.sh

When prompted, enter your PhysioNet credentials (password will not be displayed).

What these scripts do:

  1. Download source datasets from PhysioNet (MIMIC-CXR-JPG, Chest ImaGenome, MIMIC-IV)
  2. Preprocess the datasets
  3. Generate complete MIMIC-CXR-VQA dataset with ground-truth answers and metadata
Downloading Images Only

To download only the CXR images relevant to MIMIC-CXR-VQA (rather than all MIMIC-CXR-JPG images):

bash download_images.sh

This script reads image paths from the dataset JSON files and downloads only the required images from PhysioNet.

Dataset Structure
mimiccxrvqa/
└── dataset/
    ├── ans2idx.json          # Answer to index mapping
    ├── _train_part1.json     # Pre-release (without answers)
    ├── _train_part2.json     # Pre-release (without answers)
    ├── _valid.json           # Pre-release (without answers)
    ├── _test.json            # Pre-release (without answers)
    ├── train.json            # Generated after running script
    ├── valid.json            # Generated after running script
    └── test.json             # Generated after running script

Pre-release files (_*.json) are intentionally incomplete to safeguard privacy. Complete files with answers and metadata are generated after running the reproduction script with valid PhysioNet credentials.

Dataset Schema

Each QA sample is a JSON object with the following fields:

Core Fields:

  • split: Dataset split (train/valid/test)
  • idx: Instance index
  • image_id: Associated image ID
  • question: Natural language question
  • answer: Answer string (generated by script)

Template Fields:

  • content_type: Content category (anatomy, attribute, presence, abnormality, plane, gender, size)
  • semantic_type: Question type (verify, choose, query)
  • template: Question template
  • template_program: Program to generate answer from database
  • template_arguments: Template argument values (object, attribute, category, viewpos, gender)

Metadata (generated by script):

  • subject_id: Patient ID
  • study_id: Study ID
  • image_path: Image file path

Example:

{
    "split": "train",
    "idx": 13280,
    "image_id": "34c81443-5a19ccad-7b5e431c-4e1dbb28-42a325c0",
    "question": "Are there signs of both pleural effusion and lung cancer in the left lower lung zone?",
    "content_type": "attribute",
    "semantic_type": "verify",
    "template": "Are there signs of both ${attribute_1} and ${attribute_2} in the ${object}?",
    "template_program": "program_5",
    "template_arguments": {
        "object": {"0": "left lower lung zone"},
        "attribute": {"0": "pleural effusion", "1": "lung cancer"}
    },
    "answer": "no",
    "subject_id": "10000032",
    "study_id": "50414267",
    "image_path": "files/p10/p10000032/s50414267/34c81443-5a19ccad-7b5e431c-4e1dbb28-42a325c0.jpg"
}

Version

Current: v1.0.0

This project uses semantic versioning. For detailed changes, see CHANGELOG.

Citation

When you use the MIMIC-CXR-VQA dataset, we would appreciate it if you cite the following:

@article{bae2024ehrxqa,
  title={EHRXQA: A multi-modal question answering dataset for electronic health records with chest x-ray images},
  author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric and Kim, Tackeun and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

License

The code in this repository is provided under the terms of the MIT License. The final output dataset (MIMIC-CXR-VQA) is subject to the terms and conditions of the original datasets from Physionet: MIMIC-CXR-JPG License, Chest ImaGenome License, and MIMIC-IV License.

Contact

For questions or concerns regarding this dataset, please contact:

About

A new collection of medical VQA dataset based on MIMIC-CXR. Part of the work 'EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images'. (NeurIPS 2023 D&B)

Resources

License

Stars

Watchers

Forks

Packages

No packages published