BSc thesis project. The question: does adding structured ED triage data (vitals, acuity, missingness flags) to a chest X-ray model actually improve pneumonia detection, or does the image already capture everything useful?
Short answer: calibration improves, discrimination does not.
Data: MIMIC-CXR-JPG v2.1.0 + MIMIC-IV v2.2 + MIMIC-IV-ED v2.2 (PhysioNet credentialed access required)
Cohort: 9,137 ED-anchored studies, 80/10/10 patient-level temporal split, test set n = 1,075
Backbone: DenseNet-121 pretrained on the multilabel CheXpert-style task using 41,214 non-ED MIMIC-CXR-JPG studies, then fine-tuned for binary pneumonia
Test set metrics (u_ignore label policy, 2,000-replicate patient-level paired bootstrap).
| Model | AUROC | AUPRC | ECE |
|---|---|---|---|
| Logistic Regression (triage only) | 0.606 | 0.548 | 0.037 |
| XGBoost (triage only) | 0.611 | 0.567 | 0.046 |
| Image-only DenseNet-121 | 0.746 | 0.724 | 0.067 |
| Multimodal — Concat MLP | 0.736 | 0.714 | 0.040 |
| Multimodal — Attention Fusion | 0.737 | 0.710 | 0.132 |
The multimodal concat model does not significantly outperform image-only on discrimination (ΔAUROC = −0.009, 95% CI [−0.023, +0.005], P(Δ>0) = 0.10). The meaningful difference is calibration: ECE drops from 0.067 to 0.040, a 40% reduction. Paired bootstrap on ΔECE gives 95% CI [−0.041, +0.003] with P(ΔECE<0) = 0.961 — strong directional evidence, borderline by a zero-exclusion CI criterion. Post-hoc temperature scaling on the image model (T = 1.21) reduces image ECE only to 0.065, so the calibration gap is not recoverable via a single-scalar correction. Both deep-learning models beat the triage-only baselines by a large margin (P(Δ>0) = 1.0).
Attention fusion matched image-only on AUROC but was severely miscalibrated (ECE 0.132), making it unsuitable at this feature-set size.
Internal note on non-ED generalization: running the image model on non-ED MIMIC-CXR (n = 9,589) gives AUROC 0.534 [0.520, 0.548]. The backbone was pretrained on this population so this is not an independent test.
All numbers, CIs, and pairwise comparisons: artifacts/evaluation/final_publication_report.json
src/
data/ cohort construction and feature engineering (19 scripts)
datasets/ PyTorch Dataset classes for binary, multilabel, multimodal
models/ DenseNet-121 backbone, TabularMLP, concat and attention fusion
training/ training scripts for all model types
evaluation/ bootstrap, calibration analysis, decision curve analysis
interpretability/ Grad-CAM
qc/ data quality checks
scripts/ figure generation, SHAP, temperature scaling, delta-ECE bootstrap, publication report aggregation
configs/
experiments/ YAML configs for the main training runs
paths.local.example.yaml
artifacts/
evaluation/ bootstrap JSONs, calibration plots, ablation CSV, SHAP figures
interpretability/ Grad-CAM overlays (TP, FP, FN cases)
models/ config.json and metrics.json per run (weights not committed)
tests/ 77 unit tests covering models, feature engineering, bootstrap eval
Dockerfile
Makefile
streamlit_app.py dashboard for browsing run metrics and bootstrap results
t₀ anchor: CXR acquisition timestamp (DICOM StudyDate + StudyTime). Every clinical feature comes from the ED triage intake, which structurally precedes imaging. There are no post-t₀ features.
Label policy: CheXpert uncertain labels are excluded from training (u_ignore). Results under u_zero and u_one policies are included in the ablation.
Split: Patients ranked by their first t₀, then assigned 80/10/10. No patient appears in more than one split.
pip install -r requirements_dev.txt
pip install -e . --no-depsMIMIC data requires PhysioNet credentialed access. After downloading, set your local paths:
cp configs/paths.local.example.yaml configs/paths.local.yaml
# fill in MIMIC roots in paths.local.yamlBuild the cohort manifests from raw MIMIC data first (required before any training step):
bash scripts/run_data_pipeline.shThen run the training and evaluation pipeline:
make pretrain # multilabel CheXpert pretraining
make finetune_image # image-only binary pneumonia
make finetune_multimodal # image + triage fusion
make train_clinical # logistic regression and XGBoost baselines
make evaluate # bootstrap, calibration, ablation
make shap # SHAP for XGBoost
make report # write final_publication_report.json
make all # full pipeline (after run_data_pipeline.sh)Docker (requires GPU):
docker build -t multimodal-pneumonia .
docker run --gpus all \
-v /path/to/mimic_cxr_jpg:/workspace/mimic_cxr_jpg:ro \
-v /path/to/mimic_iv_ed:/workspace/mimic_iv_ed:ro \
-v /path/to/artifacts:/workspace/artifacts \
multimodal-pneumonia make allTrained weights (~800 MB–1.2 GB per checkpoint) are not redistributed under the PhysioNet Data Use Agreement. To obtain weights, complete PhysioNet credentialing and re-run make all after configuring configs/paths.local.yaml.
python -m pytest tests/ -v77 tests, 1 skipped (requires the temporal constraint columns). Tests cover model forward passes, feature clipping bounds, bootstrap metric computation, and pipeline QC invariants. No MIMIC data required.
streamlit run streamlit_app.pyThe dashboard reads pre-computed metrics from artifacts/evaluation/ and runs without MIMIC access for browsing results, calibration plots, bootstrap comparisons, and Grad-CAM overlays.
Uses MIMIC-CXR-JPG v2.1.0, MIMIC-IV v2.2, and MIMIC-IV-ED v2.2 under the PhysioNet Credentialed Health Data License. No patient-level data is committed to this repository. Reproducing results requires completing PhysioNet credentialing at physionet.org.