Transformer + GNN model to predict spatial gene expression from H&E histology. Original project authors: Yuansong Zeng, Zhuoyi Wei, Weijiang Yu, Rui Yin, Bingling Li, Zhonghui Tang, Yutong Lu, Yuedong Yang.
This repository adds a robust, universal pipeline to run zero-shot inference on external HEST data and a complete downstream analysis workflow.
Hist2ST/
βββ π Core Analysis Scripts (R)
β βββ streamlined_spatial_analysis.R # Primary spatial visualization
β βββ spot_based_gene_correlation.R # Spot-level correlation analysis
β βββ spatial_gene_expression_comparison.R # Gene expression comparison
β βββ generate_qc_plots.R # QC plots generation
β βββ analyze_ncount_differences.R # nCount analysis
β βββ seurat_compare_true_vs_pred.R # Optional UMAP comparison
βββ π Core Pipeline Scripts (Python)
β βββ predict_hest_universal.py # Main prediction script
β βββ analyze_hest_universal.py # Analysis pipeline
β βββ align_data_for_analysis.py # Data alignment
β βββ utils.py # Utility functions
β βββ HIST2ST.py # Model definition
βββ π Shell Wrappers
β βββ run_prediction.sh # Prediction wrapper
β βββ run_analysis.sh # Analysis wrapper
βββ π Directories
β βββ data/hest_data/ # Input data
β βββ model/ # Pre-trained weights
β βββ output/ # Results output
βββ π Documentation
βββ README.md # This file
βββ Workflow.png # Workflow diagram
- Model: Hist2ST combines CNN/Transformer for global context and GNN for local spatial structure; predicts gene expression (ZINB/NB heads).
- Our contribution: A universal, sample-agnostic inference + analysis pipeline for HEST data with clean outputs and shell wrappers.
- Core focus: Spatial gene expression visualization - comparing TRUE vs PREDICTED spatial patterns across tissue sections.
β οΈ CRITICAL LIMITATIONS:- Species-specific: Only works with human samples (trained on human HER2+ breast cancer)
- Gene set dependency: Requires specific human gene identifiers (18,085 genes, first 785 used)
- Cross-species failure: Cannot process mouse, rat, or other mammalian samples
- Limited generalization: Performance varies significantly across different human datasets
- Upstream usage: Training/tutorial notebooks from the original repo are kept for reference.
# 1) Ensure data + model are present
# data/hest_data/st/{SAMPLE_ID}.h5ad
# data/hest_data/wsis/{SAMPLE_ID}.tif
# model/5-Hist2ST.ckpt
# 2) Make scripts executable (first time only)
chmod +x run_prediction.sh run_analysis.sh
# 3) Run prediction and analysis
./run_prediction.sh MEND159
./run_analysis.sh MEND159data/hest_data/
βββ st/{SAMPLE_ID}.h5ad # counts + spatial
βββ wsis/{SAMPLE_ID}.tif # H&E fallback image
model/
βββ 5-Hist2ST.ckpt # pretrained weights
Optional gene list: data/her_hvg_cut_1000.npy (first 785 used if present).
output/{SAMPLE_ID}/
βββ predictions/
β βββ {SAMPLE_ID}_pred.h5ad # 785-gene predictions
β βββ {SAMPLE_ID}_expr_true_aligned.csv # aligned true expression
β βββ {SAMPLE_ID}_expr_pred_aligned.csv # aligned predicted expression
β βββ {SAMPLE_ID}_coords_aligned.csv # aligned coordinates
β βββ correlation_results.npy # Pearson/Spearman + overlap genes
βββ analysis/
β βββ {SAMPLE_ID}_analyzed.h5ad # processed AnnData
β βββ {SAMPLE_ID}_common_genes_used.csv # genes used for analysis
β βββ {SAMPLE_ID}_all_spots_correlations.csv # spot-level correlations
β βββ {SAMPLE_ID}_selected_spots_correlations.csv # selected spots
β βββ {SAMPLE_ID}_top_genes_spot_correlations.csv # top genes correlations
β βββ clustering_results.csv
β βββ marker_genes.csv
βββ visualizations/ # **Core: Spatial gene expression plots**
β βββ {SAMPLE_ID}_spatial_*_comparison.png # **TRUE vs PRED spatial patterns**
β βββ {SAMPLE_ID}_spatial_true_summary.png # **TRUE spatial summary**
β βββ {SAMPLE_ID}_spatial_pred_summary.png # **PRED spatial summary**
β βββ {SAMPLE_ID}_spot_scatter_*.png # individual spot scatter plots
β βββ {SAMPLE_ID}_multi_spot_scatter_plots.png # multi-spot summary
β βββ {SAMPLE_ID}_spatial_correlation_map.png # spatial correlation map
β βββ {SAMPLE_ID}_spot_correlation_distributions.png # correlation distributions
β βββ {SAMPLE_ID}_true_umap_shared.png # (Optional) TRUE UMAP
β βββ {SAMPLE_ID}_pred_umap_shared.png # (Optional) PRED UMAP
βββ logs/ # pipeline logs
run_prediction.sh SAMPLE_IDβ shell wrapper for inferencerun_analysis.sh SAMPLE_IDβ shell wrapper for downstream analysispredict_hest_universal.pyβ universal prediction (loads .h5ad + .tif, builds KNN graph, runs Hist2ST)analyze_hest_universal.pyβ QC, HVG, PCA/UMAP/t-SNE, clustering, DE, spatial plots
-
spatial_gene_expression_comparison.Rβ π― PRIMARY: Spatial gene expression visualization- Compares TRUE vs PREDICTED spatial patterns for selected genes
- Generates spatial feature plots for top variable genes
- No UMAP - focuses purely on spatial visualization
- Usage:
Rscript spatial_gene_expression_comparison.R SAMPLE_ID output NUM_GENES
-
spot_based_gene_correlation.Rβ π― PRIMARY: Spot-level gene correlation analysis- X-axis: predicted gene expression, Y-axis: true gene expression
- Each point represents a gene within a specific spot
- Analyzes correlation patterns across spatial locations
- Generates spatial correlation maps
- Usage:
Rscript spot_based_gene_correlation.R SAMPLE_ID output NUM_SPOTS
seurat_compare_true_vs_pred.Rβ Optional: UMAP comparison using identical gene sets- Uses exactly the same genes for both datasets
- Generates UMAP plots, correlation histograms
- Ensures fair comparison across samples
- Usage:
Rscript seurat_compare_true_vs_pred.R SAMPLE_ID output
Advanced (Python flags):
python predict_hest_universal.py SAMPLE_ID \
--device auto --data_dir data/hest_data --output_dir output# Run prediction and basic analysis
./run_prediction.sh MEND159
./run_analysis.sh MEND159
# **Primary: Spatial gene expression visualization**
Rscript spatial_gene_expression_comparison.R MEND159 output 10
# **Primary: Spot-based gene correlation analysis**
Rscript spot_based_gene_correlation.R MEND159 output 20# UMAP comparison with identical gene sets (optional)
Rscript seurat_compare_true_vs_pred.R MEND159 output# Analyze different number of genes/spots
Rscript spatial_gene_expression_comparison.R MEND159 output 15
Rscript spot_based_gene_correlation.R MEND159 output 50# INT22 (665 common genes) - Complete workflow
./run_prediction.sh INT22
./run_analysis.sh INT22
Rscript streamlined_spatial_analysis.R INT22 output 6
Rscript generate_qc_plots.R INT22
Rscript analyze_ncount_differences.R INT22
# MEND159 (779 common genes) - Complete workflow
./run_prediction.sh MEND159
./run_analysis.sh MEND159
Rscript streamlined_spatial_analysis.R MEND159 output 6
Rscript generate_qc_plots.R MEND159
Rscript analyze_ncount_differences.R MEND159- Spatial Gene Expression: TRUE vs PREDICTED patterns across tissue
- Spot-level Analysis: Individual spatial point correlation analysis
- Spatial Correlation Maps: Geographic distribution of prediction quality
- No UMAP Required: Focus on spatial patterns, not dimensionality reduction
- Unified Analysis: All scripts ensure TRUE and PREDICTED data use identical gene sets
- Sample Agnostic: Works with any sample regardless of gene count
- Consistent Comparison: Fair evaluation across different datasets
- Spatial Visualization: Gene expression patterns across tissue (Primary)
- Spot-level Correlation: Individual spatial point analysis (Primary)
- Spatial Correlation Maps: Geographic distribution of prediction quality (Primary)
- UMAP Comparison: Clustering analysis with shared features (Optional)
- Gene-wise correlations: Pearson/Spearman across all genes
- Spot-wise correlations: Individual spatial point performance
- Spatial patterns: Geographic distribution of prediction accuracy
- Distribution analysis: Overall model performance statistics
- Config:
5-7-2-8-4-16-32,n_genes=785, dropout=0.2 - Weights: loaded with
strict=Falseto allow partial compatibility - Graph:
k=6; dynamically switchespruneTag(Grid/NA) by coordinate range - Coordinates: normalized to integer indices (0β63) for embeddings
- Seeds fixed (12000) for reproducibility
- Gene Set Consistency: All analyses use
intersect()to ensure identical genes
import torch
from HIST2ST import Hist2ST
model = Hist2ST(depth1=2, depth2=8, depth3=4,
n_genes=785, kernel_size=5, patch_size=7,
heads=16, channel=32, dropout=0.2,
zinb=0.25, nb=False, bake=5, lamb=0.5)
# patches: [B, N, 3, H, W]
# coords: [B, N, 2] (long indices 0..63)
# adj: [N, N]
# out: [B, N, n_genes]- Python >= 3.7, PyTorch >= 1.10, pytorch-lightning >= 1.4, scanpy >= 1.8, scipy, PIL, tqdm
- R >= 4.0, Seurat >= 5.0, ggplot2, dplyr, gridExtra, corrplot
- "Pre-trained model not found": put
5-Hist2ST.ckptundermodel/ - "No overlapping genes":
- Most common cause: Non-human sample (mouse, rat, etc.)
- Check species: Verify sample is human-derived
- Check gene overlap: Run compatibility check from Quick Start
- Solution: Use only human samples with >200 overlapping genes
- "Species incompatibility":
- Error: "No overlapping genes between model and {SAMPLE_ID}"
- Cause: Model trained on human genes, sample has different species genes
- Example: MEND73 (mouse, 32,285 genes) failed with only 18 overlapping genes
- Solution: Use human samples only
- Very low correlations: expected in zero-shot cross-dataset; predictions can still be useful
- PIL DecompressionBombWarning: safe for large WSIs
- R package errors: Install required R packages:
install.packages(c("Seurat", "ggplot2", "dplyr", "gridExtra", "corrplot"))
- HER2+ breast tumor ST:
https://github.com/almaan/her2st - cSCC 10x Visium (GSE144240)
- Synapse mirror of trained models and data indices (see upstream paper)
Please cite the original authors:
@article{zengys,
title={Spatial Transcriptomics Prediction from Histology jointly through Transformer and Graph Neural Networks},
author={Yuansong Zeng and Zhuoyi Wei and Weijiang Yu and Rui Yin and Bingling Li and Zhonghui Tang and Yutong Lu and Yuedong Yang},
journal={bioRxiv},
year={2021},
publisher={Cold Spring Harbor Laboratory}
}
- β SUPPORTED: Human samples only
- β NOT SUPPORTED: Mouse, rat, non-human primate, or any non-human species
- Reason: Model trained exclusively on human HER2+ breast cancer samples
- Required genes: 18,085 human genes from
her_hvg_cut_1000.npy - Prediction output: Fixed 785 genes (first 785 from required gene set)
- Minimum overlap: >200 overlapping genes for meaningful prediction
- Gene naming: Must use human gene identifiers (e.g.,
SAMD11,NOC2L)
Before running prediction, verify sample compatibility:
# Check if sample is human and has sufficient gene overlap
python -c "
import scanpy as sc
import numpy as np
ad = sc.read_h5ad('data/hest_data/st/{SAMPLE_ID}.h5ad')
expected = np.load('data/her_hvg_cut_1000.npy')
overlap = set(ad.var_names) & set(expected)
print(f'Sample genes: {len(ad.var_names)}')
print(f'Expected genes: {len(expected)}')
print(f'Overlap: {len(overlap)} genes')
print(f'Compatible: {len(overlap) > 200}')
"- MEND73: Mouse sample, 32,285 genes, only 18 overlapping β β FAILED
- Other mouse/rat samples: Expected to fail due to species mismatch
- INT22: Human sample β β SUCCESS (665 common genes, 3,382 spots)
- MEND159: Human sample β β SUCCESS (779 common genes, 3,382 spots)
- Human samples: 2/2 (100% success)
- Non-human samples: 1/1 (0% success)
- Overall: 67% (2/3 samples)
- Always check compatibility first using the command above
- Use only human samples for prediction
- Verify gene overlap >200 genes before proceeding
- Have alternative methods ready for non-human samples
- Species-specific models: Train new models on target species
- Gene orthology mapping: Map genes between species
- Custom training: Retrain Hist2ST on target datasets
- Other tools: Use species-agnostic spatial analysis methods
- TRUE: library-size normalization + log1p
- PRED: used as produced by Hist2ST (continuous; may include small negatives)
- Metrics:
- Gene-wise Pearson (across spots) as primary (paper-consistent)
- Spot-wise Pearson (across genes) as complementary
- Strict intersection of genes and spots; identical ordering ensured
Sample,Type,Mean,Median,Min,Max,N
A2,Gene-wise (Pearson),0.1500535903892705,0.1458614533150753,-0.1009585743451398,0.5605070053736968,785
A2,Spot-wise (Pearson),0.5211877134030255,0.527795645381485,-0.004018497100146,0.591066113056179,325
INT22,Gene-wise (Pearson),-0.013081627379449,-0.0125343126866175,-0.1120029450309284,0.0938526612949563,737
INT22,Spot-wise (Pearson),-0.0064532180821096,-0.0063908200822004,-0.107613361046767,0.115441141597783,3829
MEND159,Gene-wise (Pearson),-0.0005007817283359,0.0004339587172078,-0.0654807693105191,0.061827421947797,665
MEND159,Spot-wise (Pearson),-0.0089678414943827,-0.0115978723766035,-0.100886699853229,0.152289124472616,3382
TENX13,Gene-wise (Pearson),-0.0392731324528164,-0.0381869765576874,-0.1563039807115384,0.0736455655134945,713
TENX13,Spot-wise (Pearson),-0.0228815994741047,-0.0238930556419763,-0.106568189364158,0.10093894225413,3813
- Prediction:
python predict_hest_universal.py SAMPLE_ID --data_dir data/hest_data/SAMPLE_ID --output_dir output --device auto - Align for analysis:
python align_data_for_analysis.py SAMPLE_ID - Spot-wise analysis:
Rscript spot_based_gene_correlation.R SAMPLE_ID output 8 hist2st - QC plots:
Rscript generate_qc_plots.R SAMPLE_ID
- Large raw data (h5ad/WSI/ckpt) are excluded by .gitignore.
- Analysis outputs (CSVs, PNGs) are included when size permits.