You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7Lines changed: 7 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,6 +41,13 @@ pip install -e .
41
41
## 🚀 Usage
42
42
python scripts/build_db.py
43
43
44
+
## To add more modalities (e.g., genomics, pathology):
45
+
1. Define new dataset schema and extraction instructions in `src/config.py`
46
+
2. Implement new class and extraction function in `src/extract_MODALITY_dataset_information_llm.py`
47
+
3. Import and call the new extraction function in `scripts/build_db.py` and add a conditional to check the modality type
48
+
4. Optionally, update .github/workflows/update_dbs.yml to run the pipeline for the new modality on a schedule
49
+
All instructions are notaded in the code with comments like `#* add additional extraction instructions and functions for other modalities here, e.g. genomics, pathology, etc`
"# import src.extract_radiology_dataset_information_llm as erdil\n",
18
18
"# importlib.reload(erdil)\n",
19
-
"# from src.extract_radiology_dataset_information_llm import extract_with_agent\n",
19
+
"# from src.extract_radiology_dataset_information_llm import extract_radiology_dataset_info_with_agent\n",
20
20
"\n",
21
21
"gpu_id = 0\n",
22
22
"vllm_port = 8001\n",
@@ -58,7 +58,7 @@
58
58
},
59
59
{
60
60
"cell_type": "code",
61
-
"execution_count": 3,
61
+
"execution_count": null,
62
62
"id": "15b3d6f8",
63
63
"metadata": {},
64
64
"outputs": [
@@ -83,7 +83,7 @@
83
83
"abstract = \"The large volume of abdominal computed tomography (CT) scans1,2 coupled with the shortage of radiologists3,4,5,6 have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision–language models (VLMs) that jointly model images and radiology reports7,8,9,10,11,12. However, current medical VLMs are generally limited to 2D images and short reports. Here to overcome these shortcomings for abdominal CT interpretation, we introduce Merlin, a 3D VLM that learns from volumetric CT scans, electronic health record data and radiology reports. This approach is enabled by a multistage pretraining framework that does not require additional manual annotations. We trained Merlin using a high-quality clinical dataset of paired CT scans (>6 million images from 15,331 CT scans), diagnosis codes (>1.8 million codes) and radiology reports (>6 million tokens). We comprehensively evaluated Merlin on 6 task types and 752 individual tasks that covered diagnostic, prognostic and quality-related tasks. The non-adapted (off-the-shelf) tasks included zero-shot classification of findings (30 findings), phenotype classification (692 phenotypes) and zero-shot cross-modal retrieval (image-to-findings and image-to-impression). The model-adapted tasks included 5-year chronic disease prediction (6 diseases), radiology report generation and 3D semantic segmentation (20 organs). We validated Merlin at scale, with internal testing on 5,137 CT scans and external testing on 44,098 CT scans from 3 independent sites and 2 public datasets. The results demonstrated high generalization across institutions and anatomies. Merlin outperformed 2D VLMs, CT foundation models and off-the-shelf radiology models. We also computed scaling laws and conducted ablation studies to identify optimal training strategies. We release our trained models, code and dataset for 25,494 pairs of abdominal CT scans and radiology reports. Our results demonstrate how Merlin may assist in the interpretation of abdominal CT scans and mitigate the burden on radiologists while simultaneously adding value for future biomarker discovery and disease risk stratification.\"\n",
("Database Management Systems"[MeSH] OR dataset[ti] OR database[ti] OR "data collection"[ti] OR "information repository"[ti] OR benchmark[ti] OR "challenge data"[ti] OR "data commons"[ti] OR "data repository"[ti] OR "data sharing"[ti])
46
52
AND ("Radiology"[MeSH] OR "Radiography"[MeSH] OR "Radiology Information Systems"[MeSH] OR radiology[tiab] OR radiograph[tiab] OR "Diagnostic Imaging"[tiab] OR "Medical Image"[tiab] OR "Medical Imaging"[tiab] OR "Biomedical Image"[tiab] OR "Biomedical Imaging"[tiab] OR XR[tiab] OR CT[tiab] OR MRI[tiab] OR PET[tiab] OR SPECT[tiab] OR "X-ray"[tiab] OR "Computed Tomography"[tiab] OR "Magnetic Resonance"[tiab] OR Ultrasound[tiab] OR "Positron Emission Tomography"[tiab] OR "Single Photon Emission Computed Tomography"[tiab])
47
53
"""# removed "Databases, Factual"[MeSH] because it dropped search space from 12319 to 3877 while keeping all of my test cases
48
54
PUBMED_QUERY=" ".join(PUBMED_QUERY.split()) # strip new lines
49
55
56
+
#* is_database_paper_classifier_llm.py
57
+
CLASSIFICATION_INSTRUCTIONS= (
58
+
"Determine whether the paper INTRODUCES or CREATES a dataset.\n"
59
+
"Return is_dataset_creation = true if:\n"
60
+
"- The paper develops, constructs, introduces, or presents a dataset/database/benchmark\n"
61
+
"- Even if the dataset has no explicit name\n\n"
62
+
"Return false if:\n"
63
+
"- The paper only uses existing datasets\n"
64
+
"- It is a methods/model paper\n"
65
+
"- It analyzes data without creating a dataset\n\n"
66
+
"Be conservative: if unsure, return true."
67
+
)
68
+
69
+
CLASSIFICATION_AGENT_INSTRUCTIONS="Classify whether this paper creates a dataset"
0 commit comments