Attention-based deep learning for analysis of pathology images and gene expression data in lung squamous premalignant lesions

Introduction

We propose a missing-data-friendly framework that integrates histologic and transcriptomic data to discriminate dysplastic versus non-dysplastic lesions and enable stratification of early lung lesions. The study leverages H&E whole slide images (WSIs) of endobronchial biopsies and bulk gene expression data (GE) derived from endobronchial biopsies and brushings from six previously published studies and on-going lung precancer atlas efforts obtained from patients at high-risk for lung cancer.

Training

1. WSI feature extraction

WSI features are obtained through binary mask-based tessellation followed by patch feature extraction using UNI2-h. Related scripts can be found in ./preprocessing/WSI/. Example executions are available in ./scripts. Example workflow:

cd scripts
bash compute_mask.sh
bash tile_WSI.sh
bash extract_features.sh

2. Transcriptomic feature harmonization

Gene features associated with bronchial dysplasia are selected using a linear mixed-effects model applied to normalized, batch-corrected bulk RNA-seq data. Related scripts can be found in ./preprocessing/gene/.

3. Training single- and dual- modality models (Models a, b, c, d)

A single-modality model is trained and validated on samples with WSI only (Model a) or GE only (Model c). A dual-modality model is trained on samples with both WSI and GE and validated on samples with WSI only (Model b) or GE only (Model d).

Implementation details: ./src/model/training1.py.
- The presence or absence of each modality is controlled by the --mask-gene and --mask-wsi flags.
Example: ./scripts/training1.sh.

4. Training the flexible fusion model

The flexible fusion model relaxes the requirement of paired WSI and GE data and allows the training data to include either WSI or GE or both.

Implementation details: ./src/model/training2.py.
Example: ./scripts/training2.sh.

Evaluation

To evaluate a single- or dual- modality model, run ./src/model/inference1.py. Example: ./scripts/inference1.sh.
To evaluate the flexible fusion model, run ./src/model/inference2.py. Example: ./scripts/inference2.sh.

References

[1] Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F., Jaume, G., Song, A.H., Chen, B., Zhang, A., Shao, D., Shaban, M. and Williams, M., 2024. Towards a general-purpose foundation model for computational pathology. Nature medicine, 30(3), pp.850-862.

[2] MahmoodLab. UNI2-h. Hugging Face model repository. https://huggingface.co/MahmoodLab/UNI2-h (CC-BY-NC-ND-4.0), 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
figs		figs
gene_analysis		gene_analysis
preprocessing		preprocessing
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Attention-based deep learning for analysis of pathology images and gene expression data in lung squamous premalignant lesions

Introduction

Training

1. WSI feature extraction

2. Transcriptomic feature harmonization

3. Training single- and dual- modality models (Models a, b, c, d)

4. Training the flexible fusion model

Evaluation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Attention-based deep learning for analysis of pathology images and gene expression data in lung squamous premalignant lesions

Introduction

Training

1. WSI feature extraction

2. Transcriptomic feature harmonization

3. Training single- and dual- modality models (Models a, b, c, d)

4. Training the flexible fusion model

Evaluation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages