Attention-based deep learning for analysis of pathology images and gene expression data in lung squamous premalignant lesions

Introduction

We propose a missing-data-friendly framework that integrates histologic and transcriptomic data to discriminate dysplastic versus non-dysplastic lesions and enable stratification of early lung lesions. The study leverages H&E whole slide images (WSIs) of endobronchial biopsies and bulk gene expression data (GE) derived from endobronchial biopsies and brushings from six previously published studies and on-going lung precancer atlas efforts obtained from patients at high-risk for lung cancer.

Training

1. WSI feature extraction

WSI features are obtained through binary mask-based tessellation followed by patch feature extraction using UNI2-h. Related scripts can be found in ./preprocessing/WSI/. Example executions are available in ./scripts. Example workflow:

cd scripts
bash compute_mask.sh
bash tile_WSI.sh
bash extract_features.sh

2. Transcriptomic feature harmonization

Gene features associated with bronchial dysplasia are selected using a linear mixed-effects model applied to normalized, batch-corrected bulk RNA-seq data. Related scripts can be found in ./preprocessing/gene/.

3. Training single- and dual- modality models (Models a, b, c, d)

A single-modality model is trained and validated on samples with WSI only (Model a) or GE only (Model c). A dual-modality model is trained on samples with both WSI and GE and validated on samples with WSI only (Model b) or GE only (Model d).

Implementation details: ./src/model/training1.py.
- The presence or absence of each modality is controlled by the --mask-gene and --mask-wsi flags.
Example: ./scripts/training1.sh.

4. Training the flexible fusion model

The flexible fusion model relaxes the requirement of paired WSI and GE data and allows the training data to include either WSI or GE or both.

Implementation details: ./src/model/training2.py.
Example: ./scripts/training2.sh.

Evaluation

To evaluate a single- or dual- modality model, run ./src/model/inference1.py. Example: ./scripts/inference1.sh.
To evaluate the flexible fusion model, run ./src/model/inference2.py. Example: ./scripts/inference2.sh.

References

[1] Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F., Jaume, G., Song, A.H., Chen, B., Zhang, A., Shao, D., Shaban, M. and Williams, M., 2024. Towards a general-purpose foundation model for computational pathology. Nature medicine, 30(3), pp.850-862.

[2] MahmoodLab. UNI2-h. Hugging Face model repository. https://huggingface.co/MahmoodLab/UNI2-h (CC-BY-NC-ND-4.0), 2025.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention-based deep learning for analysis of pathology images and gene expression data in lung squamous premalignant lesions

Introduction

Training

1. WSI feature extraction

2. Transcriptomic feature harmonization

3. Training single- and dual- modality models (Models a, b, c, d)

4. Training the flexible fusion model

Evaluation

References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Attention-based deep learning for analysis of pathology images and gene expression data in lung squamous premalignant lesions

Introduction

Training

1. WSI feature extraction

2. Transcriptomic feature harmonization

3. Training single- and dual- modality models (Models a, b, c, d)

4. Training the flexible fusion model

Evaluation

References