Skip to content

Commit b5feaa8

Browse files
committed
added update github actions and support for more data modalities
1 parent 50e9ef6 commit b5feaa8

File tree

7 files changed

+94
-41
lines changed

7 files changed

+94
-41
lines changed

.github/workflows/update_dbs.yml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# name: Weekly Build Radiology DB
2+
3+
# on:
4+
# schedule:
5+
# - cron: "0 9 * * 1" # every Monday at 09:00 UTC
6+
# workflow_dispatch: # allows manual trigger
7+
8+
# jobs:
9+
# build:
10+
# runs-on: ubuntu-latest
11+
12+
# strategy:
13+
# matrix:
14+
# modality: ["radiology"] #* add additional modalities here, e.g. genomics, pathology, etc
15+
16+
# steps:
17+
# - name: Checkout repo
18+
# uses: actions/checkout@v4
19+
20+
# - name: Set up Python
21+
# uses: actions/setup-python@v5
22+
# with:
23+
# python-version: "3.10"
24+
25+
# - name: Install dependencies
26+
# run: |
27+
# pip install -e .
28+
29+
# - name: Run build script
30+
# run: |
31+
# python scripts/build_db.py --database-modality ${{ matrix.modality }}

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,13 @@ pip install -e .
4141
## 🚀 Usage
4242
python scripts/build_db.py
4343

44+
## To add more modalities (e.g., genomics, pathology):
45+
1. Define new dataset schema and extraction instructions in `src/config.py`
46+
2. Implement new class and extraction function in `src/extract_MODALITY_dataset_information_llm.py`
47+
3. Import and call the new extraction function in `scripts/build_db.py` and add a conditional to check the modality type
48+
4. Optionally, update .github/workflows/update_dbs.yml to run the pipeline for the new modality on a schedule
49+
All instructions are notaded in the code with comments like `#* add additional extraction instructions and functions for other modalities here, e.g. genomics, pathology, etc`
50+
4451
## Testing
4552
### Just unit tests:
4653
pytest

notebooks/llm_extraction.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,12 @@
1111
"base_directory = os.path.dirname(os.path.abspath(\"\"))\n",
1212
"\n",
1313
"import pandas as pd\n",
14-
"from src.extract_radiology_dataset_information_llm import extract_with_agent\n",
14+
"from src.extract_radiology_dataset_information_llm import extract_radiology_dataset_info_with_agent\n",
1515
"\n",
1616
"# import importlib\n",
1717
"# import src.extract_radiology_dataset_information_llm as erdil\n",
1818
"# importlib.reload(erdil)\n",
19-
"# from src.extract_radiology_dataset_information_llm import extract_with_agent\n",
19+
"# from src.extract_radiology_dataset_information_llm import extract_radiology_dataset_info_with_agent\n",
2020
"\n",
2121
"gpu_id = 0\n",
2222
"vllm_port = 8001\n",
@@ -58,7 +58,7 @@
5858
},
5959
{
6060
"cell_type": "code",
61-
"execution_count": 3,
61+
"execution_count": null,
6262
"id": "15b3d6f8",
6363
"metadata": {},
6464
"outputs": [
@@ -83,7 +83,7 @@
8383
"abstract = \"The large volume of abdominal computed tomography (CT) scans1,2 coupled with the shortage of radiologists3,4,5,6 have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision–language models (VLMs) that jointly model images and radiology reports7,8,9,10,11,12. However, current medical VLMs are generally limited to 2D images and short reports. Here to overcome these shortcomings for abdominal CT interpretation, we introduce Merlin, a 3D VLM that learns from volumetric CT scans, electronic health record data and radiology reports. This approach is enabled by a multistage pretraining framework that does not require additional manual annotations. We trained Merlin using a high-quality clinical dataset of paired CT scans (>6 million images from 15,331 CT scans), diagnosis codes (>1.8 million codes) and radiology reports (>6 million tokens). We comprehensively evaluated Merlin on 6 task types and 752 individual tasks that covered diagnostic, prognostic and quality-related tasks. The non-adapted (off-the-shelf) tasks included zero-shot classification of findings (30 findings), phenotype classification (692 phenotypes) and zero-shot cross-modal retrieval (image-to-findings and image-to-impression). The model-adapted tasks included 5-year chronic disease prediction (6 diseases), radiology report generation and 3D semantic segmentation (20 organs). We validated Merlin at scale, with internal testing on 5,137 CT scans and external testing on 44,098 CT scans from 3 independent sites and 2 public datasets. The results demonstrated high generalization across institutions and anatomies. Merlin outperformed 2D VLMs, CT foundation models and off-the-shelf radiology models. We also computed scaling laws and conducted ablation studies to identify optimal training strategies. We release our trained models, code and dataset for 25,494 pairs of abdominal CT scans and radiology reports. Our results demonstrate how Merlin may assist in the interpretation of abdominal CT scans and mitigate the burden on radiologists while simultaneously adding value for future biomarker discovery and disease risk stratification.\"\n",
8484
"link = \"https://doi.org/10.1038/s41586-026-10181-8\"\n",
8585
"\n",
86-
"dataset = await extract_with_agent(title, abstract, publication_metadata={\"link\": link})\n",
86+
"dataset = await extract_radiology_dataset_info_with_agent(title, abstract, publication_metadata={\"link\": link})\n",
8787
"print(dataset)"
8888
]
8989
},

scripts/build_db.py

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,12 @@
77
import pandas as pd
88
from Bio import Entrez
99
from dotenv import load_dotenv
10+
from pydantic import BaseModel
1011
from tqdm import tqdm
1112

1213
from src.config import CONFIG, IDS_TO_KEEP, LOG_LEVEL, MODEL, PUBMED_QUERY
13-
from src.extract_radiology_dataset_information_llm import (RadiologyDataset,
14-
extract_with_agent)
14+
#* add additional extraction instructions and functions for other modalities here, e.g. genomics, pathology, etc
15+
from src.extract_radiology_dataset_information_llm import extract_radiology_dataset_info_with_agent
1516
from src.pubmed_utils import (
1617
add_column_to_isolate_mesh_terms_from_pubmed_matches,
1718
extract_pubmed_metadata, fetch_pubmed_citation_counts,
@@ -33,6 +34,7 @@
3334
def parse_args():
3435
parser = argparse.ArgumentParser(description="Build radiology dataset table")
3536

37+
parser.add_argument("--database-modality", type=str)
3638
parser.add_argument("--output-path", type=str)
3739
parser.add_argument("--output-path-failed", type=str)
3840
parser.add_argument("--max-papers", type=int)
@@ -120,7 +122,7 @@ async def main():
120122
logger.warning("No articles found.")
121123
return
122124

123-
extracted_datasets: List[RadiologyDataset] = []
125+
extracted_datasets: List[BaseModel] = []
124126
failed_metadata = []
125127
for article in tqdm(articles):
126128
try:
@@ -134,11 +136,15 @@ async def main():
134136

135137
dataset = None
136138
for _ in range(CONFIG.num_tries_agent):
137-
dataset = await extract_with_agent(title, abstract, publication_metadata)
139+
if CONFIG.database_modality == "radiology":
140+
dataset = await extract_radiology_dataset_info_with_agent(title, abstract, publication_metadata)
141+
#* add additional modalities here with corresponding extraction functions, e.g. genomics, pathology, etc
142+
else:
143+
raise ValueError(f"Unsupported database modality: {CONFIG.database_modality}. Supported modalities: radiology.")
138144
if dataset is not None:
139145
break
140146

141-
if isinstance(dataset, RadiologyDataset):
147+
if isinstance(dataset, BaseModel):
142148
extracted_datasets.append(dataset)
143149
else:
144150
logger.debug(f"Extraction failed for article: {title}")
@@ -175,7 +181,7 @@ async def main():
175181
# Save to CSV
176182
df.to_csv(CONFIG.output_path, index=False)
177183

178-
if CONFIG.output_path_failed:
184+
if CONFIG.output_path_failed and CONFIG.output_path_failed != "None": # catch "None" string from env var
179185
if len(failed_metadata) == 0:
180186
logger.info("No failed extractions to save.")
181187
else:

src/config.py

Lines changed: 37 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,29 @@
22
import logging
33
import os
44
import subprocess
5-
from dataclasses import dataclass
5+
from dataclasses import dataclass, field
66
from typing import Optional
77

88
from dotenv import load_dotenv
99

1010
load_dotenv()
1111

12-
LOG_LEVEL = logging.DEBUG #* DEBUG, INFO, WARNING, ERROR, CRITICAL
12+
LOG_LEVEL = logging.DEBUG # DEBUG, INFO, WARNING, ERROR, CRITICAL
1313

1414
@dataclass
1515
class Config:
16-
output_path: str = "data/radiology_db.csv"
17-
output_path_failed: str = "data/radiology_db_failed.csv"
16+
database_modality: str = "radiology" # e.g. radiology, genomics, pathology, etc
1817
max_papers: Optional[int] = 9999 # None for all papers; set to small number for debugging
1918
min_citations: int = 25 # filter out papers with fewer than this many citations (set to 0 to disable)
2019
num_tries_agent: int = 5
2120
overwrite: bool = False
2221

22+
output_path: str = field(init=False)
23+
output_path_failed: str = field(init=False)
24+
def __post_init__(self):
25+
self.output_path = f"data/{self.database_modality}_db.csv"
26+
self.output_path_failed = None # f"data/{self.database_modality}_db_failed.csv"
27+
2328
def get_model() -> str:
2429
model = os.getenv("MODEL", "openai:Qwen/Qwen2.5-7B-Instruct")
2530
vllm_port = os.getenv("VLLM_PORT")
@@ -40,13 +45,30 @@ def get_model() -> str:
4045
CONFIG = Config()
4146
MODEL = get_model()
4247

48+
#* PubMed
4349
# MeSH terms: https://www.ncbi.nlm.nih.gov/mesh/?term=%22radiology%22%5BMeSH%20Terms%5D%20OR%20%22radiographic%22%5BMeSH%20Terms%5D%20OR%20%22radiography%22%5BMeSH%20Terms%5D%20OR%20radiology%5BText%20Word%5D&cmd=DetailsSearch
4450
PUBMED_QUERY = """
4551
("Database Management Systems"[MeSH] OR dataset[ti] OR database[ti] OR "data collection"[ti] OR "information repository"[ti] OR benchmark[ti] OR "challenge data"[ti] OR "data commons"[ti] OR "data repository"[ti] OR "data sharing"[ti])
4652
AND ("Radiology"[MeSH] OR "Radiography"[MeSH] OR "Radiology Information Systems"[MeSH] OR radiology[tiab] OR radiograph[tiab] OR "Diagnostic Imaging"[tiab] OR "Medical Image"[tiab] OR "Medical Imaging"[tiab] OR "Biomedical Image"[tiab] OR "Biomedical Imaging"[tiab] OR XR[tiab] OR CT[tiab] OR MRI[tiab] OR PET[tiab] OR SPECT[tiab] OR "X-ray"[tiab] OR "Computed Tomography"[tiab] OR "Magnetic Resonance"[tiab] OR Ultrasound[tiab] OR "Positron Emission Tomography"[tiab] OR "Single Photon Emission Computed Tomography"[tiab])
4753
""" # removed "Databases, Factual"[MeSH] because it dropped search space from 12319 to 3877 while keeping all of my test cases
4854
PUBMED_QUERY = " ".join(PUBMED_QUERY.split()) # strip new lines
4955

56+
#* is_database_paper_classifier_llm.py
57+
CLASSIFICATION_INSTRUCTIONS = (
58+
"Determine whether the paper INTRODUCES or CREATES a dataset.\n"
59+
"Return is_dataset_creation = true if:\n"
60+
"- The paper develops, constructs, introduces, or presents a dataset/database/benchmark\n"
61+
"- Even if the dataset has no explicit name\n\n"
62+
"Return false if:\n"
63+
"- The paper only uses existing datasets\n"
64+
"- It is a methods/model paper\n"
65+
"- It analyzes data without creating a dataset\n\n"
66+
"Be conservative: if unsure, return true."
67+
)
68+
69+
CLASSIFICATION_AGENT_INSTRUCTIONS = "Classify whether this paper creates a dataset"
70+
71+
#* extract_radiology_dataset_information_llm.py
5072
EXTRACTION_INSTRUCTIONS = (
5173
"You MUST extract a dataset name.\n"
5274
"Never return null for name.\n"
@@ -70,27 +92,14 @@ def get_model() -> str:
7092

7193
EXTRACTION_AGENT_INSTRUCTIONS = "Extract dataset information"
7294

73-
#* for real time, set to None
74-
# IDS_TO_KEEP = None
75-
IDS_TO_KEEP = {
76-
"36204533", # RadImageNet
77-
"31831740", # MIMIC-CXR
78-
"32457287", # UK Biobank
79-
"23884657", # TCIA
80-
"41781626", # Merlin
81-
}
82-
83-
84-
CLASSIFICATION_INSTRUCTIONS = (
85-
"Determine whether the paper INTRODUCES or CREATES a dataset.\n"
86-
"Return is_dataset_creation = true if:\n"
87-
"- The paper develops, constructs, introduces, or presents a dataset/database/benchmark\n"
88-
"- Even if the dataset has no explicit name\n\n"
89-
"Return false if:\n"
90-
"- The paper only uses existing datasets\n"
91-
"- It is a methods/model paper\n"
92-
"- It analyzes data without creating a dataset\n\n"
93-
"Be conservative: if unsure, return true."
94-
)
95-
96-
CLASSIFICATION_AGENT_INSTRUCTIONS = "Classify whether this paper creates a dataset"
95+
#$ for real time, set to None
96+
IDS_TO_KEEP = None
97+
# IDS_TO_KEEP = {
98+
# "36204533", # RadImageNet
99+
# "31831740", # MIMIC-CXR
100+
# "32457287", # UK Biobank
101+
# "23884657", # TCIA
102+
# "41781626", # Merlin
103+
# }
104+
105+
#* add additional instructions and config variables for other modalities here, e.g. genomics, pathology, etc

src/extract_radiology_dataset_information_llm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ def name_matches_title(dataset_name: str, title: str) -> bool:
157157
# -----------------------------
158158
# LLM EXTRACTION (ASYNC)
159159
# -----------------------------
160-
async def extract_with_agent(
160+
async def extract_radiology_dataset_info_with_agent(
161161
title: str,
162162
abstract: str,
163163
publication_metadata: Optional[dict] = None,

tests/test_llm_output.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ def test_serialize_dataset_against_ground_truth(monkeypatch, paper_key):
8282
@pytest.mark.integration
8383
@pytest.mark.slow
8484
@pytest.mark.parametrize("paper_key", PAPER_KEYS)
85-
def test_extract_with_agent_integration(monkeypatch, paper_key):
85+
def test_extract_radiology_dataset_info_with_agent_integration(monkeypatch, paper_key):
8686
if not (os.getenv("VLLM_PORT") or os.getenv("OPENAI_API_KEY")):
8787
pytest.skip("Integration test requires VLLM_PORT or OPENAI_API_KEY")
8888
if not _has_integration_dependencies():
@@ -104,7 +104,7 @@ def test_extract_with_agent_integration(monkeypatch, paper_key):
104104

105105
dataset = None
106106
for _ in range(NUM_TRIES_AGENT_TEST):
107-
dataset = asyncio.run(module.extract_with_agent(title=title, abstract=abstract, publication_metadata=publication_metadata))
107+
dataset = asyncio.run(module.extract_radiology_dataset_info_with_agent(title=title, abstract=abstract, publication_metadata=publication_metadata))
108108
if dataset is not None:
109109
break
110110

0 commit comments

Comments
 (0)