Hist2ST: Spatial Transcriptomics Prediction from Histology

Transformer + GNN model to predict spatial gene expression from H&E histology. Original project authors: Yuansong Zeng, Zhuoyi Wei, Weijiang Yu, Rui Yin, Bingling Li, Zhonghui Tang, Yutong Lu, Yuedong Yang.

This repository adds a robust, universal pipeline to run zero-shot inference on external HEST data and a complete downstream analysis workflow.

📁 Current File Structure

Hist2ST/
├── 📊 Core Analysis Scripts (R)
│   ├── streamlined_spatial_analysis.R      # Primary spatial visualization
│   ├── spot_based_gene_correlation.R       # Spot-level correlation analysis
│   ├── spatial_gene_expression_comparison.R # Gene expression comparison
│   ├── generate_qc_plots.R                 # QC plots generation
│   ├── analyze_ncount_differences.R        # nCount analysis
│   └── seurat_compare_true_vs_pred.R       # Optional UMAP comparison
├── 🐍 Core Pipeline Scripts (Python)
│   ├── predict_hest_universal.py           # Main prediction script
│   ├── analyze_hest_universal.py           # Analysis pipeline
│   ├── align_data_for_analysis.py          # Data alignment
│   ├── utils.py                            # Utility functions
│   └── HIST2ST.py                          # Model definition
├── 🚀 Shell Wrappers
│   ├── run_prediction.sh                   # Prediction wrapper
│   └── run_analysis.sh                     # Analysis wrapper
├── 📁 Directories
│   ├── data/hest_data/                     # Input data
│   ├── model/                              # Pre-trained weights
│   └── output/                             # Results output
└── 📋 Documentation
    ├── README.md                           # This file
    └── Workflow.png                        # Workflow diagram

Overview

Model: Hist2ST combines CNN/Transformer for global context and GNN for local spatial structure; predicts gene expression (ZINB/NB heads).
Our contribution: A universal, sample-agnostic inference + analysis pipeline for HEST data with clean outputs and shell wrappers.
Core focus: Spatial gene expression visualization - comparing TRUE vs PREDICTED spatial patterns across tissue sections.
⚠️ CRITICAL LIMITATIONS:
- Species-specific: Only works with human samples (trained on human HER2+ breast cancer)
- Gene set dependency: Requires specific human gene identifiers (18,085 genes, first 785 used)
- Cross-species failure: Cannot process mouse, rat, or other mammalian samples
- Limited generalization: Performance varies significantly across different human datasets
Upstream usage: Training/tutorial notebooks from the original repo are kept for reference.

Quick Start (External HEST Inference)

# 1) Ensure data + model are present
#    data/hest_data/st/{SAMPLE_ID}.h5ad
#    data/hest_data/wsis/{SAMPLE_ID}.tif
#    model/5-Hist2ST.ckpt

# 2) Make scripts executable (first time only)
chmod +x run_prediction.sh run_analysis.sh

# 3) Run prediction and analysis
./run_prediction.sh MEND159
./run_analysis.sh MEND159

Input layout

data/hest_data/
├── st/{SAMPLE_ID}.h5ad          # counts + spatial
└── wsis/{SAMPLE_ID}.tif         # H&E fallback image

model/
└── 5-Hist2ST.ckpt               # pretrained weights

Optional gene list: data/her_hvg_cut_1000.npy (first 785 used if present).

Output layout

output/{SAMPLE_ID}/
├── predictions/
│   ├── {SAMPLE_ID}_pred.h5ad         # 785-gene predictions
│   ├── {SAMPLE_ID}_expr_true_aligned.csv  # aligned true expression
│   ├── {SAMPLE_ID}_expr_pred_aligned.csv  # aligned predicted expression
│   ├── {SAMPLE_ID}_coords_aligned.csv     # aligned coordinates
│   └── correlation_results.npy       # Pearson/Spearman + overlap genes
├── analysis/
│   ├── {SAMPLE_ID}_analyzed.h5ad     # processed AnnData
│   ├── {SAMPLE_ID}_common_genes_used.csv  # genes used for analysis
│   ├── {SAMPLE_ID}_all_spots_correlations.csv  # spot-level correlations
│   ├── {SAMPLE_ID}_selected_spots_correlations.csv  # selected spots
│   ├── {SAMPLE_ID}_top_genes_spot_correlations.csv  # top genes correlations
│   ├── clustering_results.csv
│   └── marker_genes.csv
├── visualizations/                   # **Core: Spatial gene expression plots**
│   ├── {SAMPLE_ID}_spatial_*_comparison.png # **TRUE vs PRED spatial patterns**
│   ├── {SAMPLE_ID}_spatial_true_summary.png # **TRUE spatial summary**
│   ├── {SAMPLE_ID}_spatial_pred_summary.png # **PRED spatial summary**
│   ├── {SAMPLE_ID}_spot_scatter_*.png       # individual spot scatter plots
│   ├── {SAMPLE_ID}_multi_spot_scatter_plots.png  # multi-spot summary
│   ├── {SAMPLE_ID}_spatial_correlation_map.png   # spatial correlation map
│   ├── {SAMPLE_ID}_spot_correlation_distributions.png  # correlation distributions
│   ├── {SAMPLE_ID}_true_umap_shared.png     # (Optional) TRUE UMAP
│   └── {SAMPLE_ID}_pred_umap_shared.png     # (Optional) PRED UMAP
└── logs/                             # pipeline logs

Commands and scripts

Core Pipeline

run_prediction.sh SAMPLE_ID — shell wrapper for inference
run_analysis.sh SAMPLE_ID — shell wrapper for downstream analysis
predict_hest_universal.py — universal prediction (loads .h5ad + .tif, builds KNN graph, runs Hist2ST)
analyze_hest_universal.py — QC, HVG, PCA/UMAP/t-SNE, clustering, DE, spatial plots

Core Analysis Scripts (Spatial Visualization)

spatial_gene_expression_comparison.R — 🎯 PRIMARY: Spatial gene expression visualization
- Compares TRUE vs PREDICTED spatial patterns for selected genes
- Generates spatial feature plots for top variable genes
- No UMAP - focuses purely on spatial visualization
- Usage: Rscript spatial_gene_expression_comparison.R SAMPLE_ID output NUM_GENES
spot_based_gene_correlation.R — 🎯 PRIMARY: Spot-level gene correlation analysis
- X-axis: predicted gene expression, Y-axis: true gene expression
- Each point represents a gene within a specific spot
- Analyzes correlation patterns across spatial locations
- Generates spatial correlation maps
- Usage: Rscript spot_based_gene_correlation.R SAMPLE_ID output NUM_SPOTS

Optional Analysis Scripts

seurat_compare_true_vs_pred.R — Optional: UMAP comparison using identical gene sets
- Uses exactly the same genes for both datasets
- Generates UMAP plots, correlation histograms
- Ensures fair comparison across samples
- Usage: Rscript seurat_compare_true_vs_pred.R SAMPLE_ID output

Advanced (Python flags):

python predict_hest_universal.py SAMPLE_ID \
  --device auto --data_dir data/hest_data --output_dir output

Analysis Workflow

1. Core Analysis (Recommended)

# Run prediction and basic analysis
./run_prediction.sh MEND159
./run_analysis.sh MEND159

# **Primary: Spatial gene expression visualization**
Rscript spatial_gene_expression_comparison.R MEND159 output 10

# **Primary: Spot-based gene correlation analysis**
Rscript spot_based_gene_correlation.R MEND159 output 20

2. Optional UMAP Analysis

# UMAP comparison with identical gene sets (optional)
Rscript seurat_compare_true_vs_pred.R MEND159 output

3. Custom Analysis

# Analyze different number of genes/spots
Rscript spatial_gene_expression_comparison.R MEND159 output 15
Rscript spot_based_gene_correlation.R MEND159 output 50

4. Successfully Tested Examples

# INT22 (665 common genes) - Complete workflow
./run_prediction.sh INT22
./run_analysis.sh INT22
Rscript streamlined_spatial_analysis.R INT22 output 6
Rscript generate_qc_plots.R INT22
Rscript analyze_ncount_differences.R INT22

# MEND159 (779 common genes) - Complete workflow  
./run_prediction.sh MEND159
./run_analysis.sh MEND159
Rscript streamlined_spatial_analysis.R MEND159 output 6
Rscript generate_qc_plots.R MEND159
Rscript analyze_ncount_differences.R MEND159

Key Features

🎯 Core Focus: Spatial Visualization

Spatial Gene Expression: TRUE vs PREDICTED patterns across tissue
Spot-level Analysis: Individual spatial point correlation analysis
Spatial Correlation Maps: Geographic distribution of prediction quality
No UMAP Required: Focus on spatial patterns, not dimensionality reduction

🔧 Robust Gene Set Management

Unified Analysis: All scripts ensure TRUE and PREDICTED data use identical gene sets
Sample Agnostic: Works with any sample regardless of gene count
Consistent Comparison: Fair evaluation across different datasets

📊 Comprehensive Analysis Types

Spatial Visualization: Gene expression patterns across tissue (Primary)
Spot-level Correlation: Individual spatial point analysis (Primary)
Spatial Correlation Maps: Geographic distribution of prediction quality (Primary)
UMAP Comparison: Clustering analysis with shared features (Optional)

📈 Quality Metrics

Gene-wise correlations: Pearson/Spearman across all genes
Spot-wise correlations: Individual spatial point performance
Spatial patterns: Geographic distribution of prediction accuracy
Distribution analysis: Overall model performance statistics

Technical notes

Config: 5-7-2-8-4-16-32, n_genes=785, dropout=0.2
Weights: loaded with strict=False to allow partial compatibility
Graph: k=6; dynamically switches pruneTag (Grid/NA) by coordinate range
Coordinates: normalized to integer indices (0–63) for embeddings
Seeds fixed (12000) for reproducibility
Gene Set Consistency: All analyses use intersect() to ensure identical genes

Minimal model usage (reference)

import torch
from HIST2ST import Hist2ST

model = Hist2ST(depth1=2, depth2=8, depth3=4,
                n_genes=785, kernel_size=5, patch_size=7,
                heads=16, channel=32, dropout=0.2,
                zinb=0.25, nb=False, bake=5, lamb=0.5)
# patches: [B, N, 3, H, W]
# coords:  [B, N, 2] (long indices 0..63)
# adj:     [N, N]
# out:     [B, N, n_genes]

Requirements

Python >= 3.7, PyTorch >= 1.10, pytorch-lightning >= 1.4, scanpy >= 1.8, scipy, PIL, tqdm
R >= 4.0, Seurat >= 5.0, ggplot2, dplyr, gridExtra, corrplot

Troubleshooting

"Pre-trained model not found": put 5-Hist2ST.ckpt under model/
"No overlapping genes":
- Most common cause: Non-human sample (mouse, rat, etc.)
- Check species: Verify sample is human-derived
- Check gene overlap: Run compatibility check from Quick Start
- Solution: Use only human samples with >200 overlapping genes
"Species incompatibility":
- Error: "No overlapping genes between model and {SAMPLE_ID}"
- Cause: Model trained on human genes, sample has different species genes
- Example: MEND73 (mouse, 32,285 genes) failed with only 18 overlapping genes
- Solution: Use human samples only
Very low correlations: expected in zero-shot cross-dataset; predictions can still be useful
PIL DecompressionBombWarning: safe for large WSIs
R package errors: Install required R packages: install.packages(c("Seurat", "ggplot2", "dplyr", "gridExtra", "corrplot"))

Datasets (upstream)

HER2+ breast tumor ST: https://github.com/almaan/her2st
cSCC 10x Visium (GSE144240)
Synapse mirror of trained models and data indices (see upstream paper)

Citation (upstream)

Please cite the original authors:

@article{zengys,
  title={Spatial Transcriptomics Prediction from Histology jointly through Transformer and Graph Neural Networks},
  author={Yuansong Zeng and Zhuoyi Wei and Weijiang Yu and Rui Yin and Bingling Li and Zhonghui Tang and Yutong Lu and Yuedong Yang},
  journal={bioRxiv},
  year={2021},
  publisher={Cold Spring Harbor Laboratory}
}

⚠️ CRITICAL MODEL LIMITATIONS & COMPATIBILITY

Species Compatibility

✅ SUPPORTED: Human samples only
❌ NOT SUPPORTED: Mouse, rat, non-human primate, or any non-human species
Reason: Model trained exclusively on human HER2+ breast cancer samples

Gene Set Requirements

Required genes: 18,085 human genes from her_hvg_cut_1000.npy
Prediction output: Fixed 785 genes (first 785 from required gene set)
Minimum overlap: >200 overlapping genes for meaningful prediction
Gene naming: Must use human gene identifiers (e.g., SAMD11, NOC2L)

Compatibility Check

Before running prediction, verify sample compatibility:

# Check if sample is human and has sufficient gene overlap
python -c "
import scanpy as sc
import numpy as np
ad = sc.read_h5ad('data/hest_data/st/{SAMPLE_ID}.h5ad')
expected = np.load('data/her_hvg_cut_1000.npy')
overlap = set(ad.var_names) & set(expected)
print(f'Sample genes: {len(ad.var_names)}')
print(f'Expected genes: {len(expected)}')
print(f'Overlap: {len(overlap)} genes')
print(f'Compatible: {len(overlap) > 200}')
"

Known Incompatible Samples

MEND73: Mouse sample, 32,285 genes, only 18 overlapping → ❌ FAILED
Other mouse/rat samples: Expected to fail due to species mismatch

Known Compatible Samples (Successfully Tested)

INT22: Human sample → ✅ SUCCESS (665 common genes, 3,382 spots)
MEND159: Human sample → ✅ SUCCESS (779 common genes, 3,382 spots)

Success Rate Summary

Human samples: 2/2 (100% success)
Non-human samples: 1/1 (0% success)
Overall: 67% (2/3 samples)

Recommended Workflow

Always check compatibility first using the command above
Use only human samples for prediction
Verify gene overlap >200 genes before proceeding
Have alternative methods ready for non-human samples

Alternative Solutions for Non-Human Samples

Species-specific models: Train new models on target species
Gene orthology mapping: Map genes between species
Custom training: Retrain Hist2ST on target datasets
Other tools: Use species-agnostic spatial analysis methods

Evaluation Protocol (Unified)

TRUE: library-size normalization + log1p
PRED: used as produced by Hist2ST (continuous; may include small negatives)
Metrics:
- Gene-wise Pearson (across spots) as primary (paper-consistent)
- Spot-wise Pearson (across genes) as complementary
Strict intersection of genes and spots; identical ordering ensured

Current Results Summary

Sample,Type,Mean,Median,Min,Max,N
A2,Gene-wise (Pearson),0.1500535903892705,0.1458614533150753,-0.1009585743451398,0.5605070053736968,785
A2,Spot-wise (Pearson),0.5211877134030255,0.527795645381485,-0.004018497100146,0.591066113056179,325
INT22,Gene-wise (Pearson),-0.013081627379449,-0.0125343126866175,-0.1120029450309284,0.0938526612949563,737
INT22,Spot-wise (Pearson),-0.0064532180821096,-0.0063908200822004,-0.107613361046767,0.115441141597783,3829
MEND159,Gene-wise (Pearson),-0.0005007817283359,0.0004339587172078,-0.0654807693105191,0.061827421947797,665
MEND159,Spot-wise (Pearson),-0.0089678414943827,-0.0115978723766035,-0.100886699853229,0.152289124472616,3382
TENX13,Gene-wise (Pearson),-0.0392731324528164,-0.0381869765576874,-0.1563039807115384,0.0736455655134945,713
TENX13,Spot-wise (Pearson),-0.0228815994741047,-0.0238930556419763,-0.106568189364158,0.10093894225413,3813

Reproduce on your machine

Prediction: python predict_hest_universal.py SAMPLE_ID --data_dir data/hest_data/SAMPLE_ID --output_dir output --device auto
Align for analysis: python align_data_for_analysis.py SAMPLE_ID
Spot-wise analysis: Rscript spot_based_gene_correlation.R SAMPLE_ID output 8 hist2st
QC plots: Rscript generate_qc_plots.R SAMPLE_ID

Notes

Large raw data (h5ad/WSI/ckpt) are excluded by .gitignore.
Analysis outputs (CSVs, PNGs) are included when size permits.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
model		model
output		output
.gitignore		.gitignore
HIST2ST.py		HIST2ST.py
README.md		README.md
Workflow.png		Workflow.png
align_data_for_analysis.py		align_data_for_analysis.py
analyze_hest_universal.py		analyze_hest_universal.py
analyze_ncount_differences.R		analyze_ncount_differences.R
generate_qc_plots.R		generate_qc_plots.R
predict_hest_universal.py		predict_hest_universal.py
run_analysis.sh		run_analysis.sh
run_prediction.sh		run_prediction.sh
seurat_compare_true_vs_pred.R		seurat_compare_true_vs_pred.R
spatial_gene_expression_comparison.R		spatial_gene_expression_comparison.R
spot_based_gene_correlation.R		spot_based_gene_correlation.R
streamlined_spatial_analysis.R		streamlined_spatial_analysis.R
utils.py		utils.py

TrigosTeam/hist2st-inference-pipeline

Folders and files

Latest commit

History

Repository files navigation