Attention-based deep learning for analysis of pathology images and gene expression data in lung squamous premalignant lesions
We propose a missing-data-friendly framework that integrates histologic and transcriptomic data to discriminate dysplastic versus non-dysplastic lesions and enable stratification of early lung lesions. The study leverages H&E whole slide images (WSIs) of endobronchial biopsies and bulk gene expression data (GE) derived from endobronchial biopsies and brushings from six previously published studies and on-going lung precancer atlas efforts obtained from patients at high-risk for lung cancer.
WSI features are obtained through binary mask-based tessellation followed by patch feature extraction using UNI2-h. Related scripts can be found in ./preprocessing/WSI/. Example executions are available in ./scripts. Example workflow:
cd scripts
bash compute_mask.sh
bash tile_WSI.sh
bash extract_features.shGene features associated with bronchial dysplasia are selected using a linear mixed-effects model applied to normalized, batch-corrected bulk RNA-seq data. Related scripts can be found in ./preprocessing/gene/.
A single-modality model is trained and validated on samples with WSI only (Model a) or GE only (Model c). A dual-modality model is trained on samples with both WSI and GE and validated on samples with WSI only (Model b) or GE only (Model d).
- Implementation details:
./src/model/training1.py.- The presence or absence of each modality is controlled by the
--mask-geneand--mask-wsiflags.
- The presence or absence of each modality is controlled by the
- Example:
./scripts/training1.sh.
The flexible fusion model relaxes the requirement of paired WSI and GE data and allows the training data to include either WSI or GE or both.
- Implementation details:
./src/model/training2.py. - Example:
./scripts/training2.sh.
- To evaluate a single- or dual- modality model, run
./src/model/inference1.py. Example:./scripts/inference1.sh. - To evaluate the flexible fusion model, run
./src/model/inference2.py. Example:./scripts/inference2.sh.
[1] Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F., Jaume, G., Song, A.H., Chen, B., Zhang, A., Shao, D., Shaban, M. and Williams, M., 2024. Towards a general-purpose foundation model for computational pathology. Nature medicine, 30(3), pp.850-862.
[2] MahmoodLab. UNI2-h. Hugging Face model repository. https://huggingface.co/MahmoodLab/UNI2-h (CC-BY-NC-ND-4.0), 2025.
