Skip to content

TrigosTeam/hist2st-inference-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Hist2ST: Spatial Transcriptomics Prediction from Histology

Transformer + GNN model to predict spatial gene expression from H&E histology. Original project authors: Yuansong Zeng, Zhuoyi Wei, Weijiang Yu, Rui Yin, Bingling Li, Zhonghui Tang, Yutong Lu, Yuedong Yang.

This repository adds a robust, universal pipeline to run zero-shot inference on external HEST data and a complete downstream analysis workflow.

πŸ“ Current File Structure

Hist2ST/
β”œβ”€β”€ πŸ“Š Core Analysis Scripts (R)
β”‚   β”œβ”€β”€ streamlined_spatial_analysis.R      # Primary spatial visualization
β”‚   β”œβ”€β”€ spot_based_gene_correlation.R       # Spot-level correlation analysis
β”‚   β”œβ”€β”€ spatial_gene_expression_comparison.R # Gene expression comparison
β”‚   β”œβ”€β”€ generate_qc_plots.R                 # QC plots generation
β”‚   β”œβ”€β”€ analyze_ncount_differences.R        # nCount analysis
β”‚   └── seurat_compare_true_vs_pred.R       # Optional UMAP comparison
β”œβ”€β”€ 🐍 Core Pipeline Scripts (Python)
β”‚   β”œβ”€β”€ predict_hest_universal.py           # Main prediction script
β”‚   β”œβ”€β”€ analyze_hest_universal.py           # Analysis pipeline
β”‚   β”œβ”€β”€ align_data_for_analysis.py          # Data alignment
β”‚   β”œβ”€β”€ utils.py                            # Utility functions
β”‚   └── HIST2ST.py                          # Model definition
β”œβ”€β”€ πŸš€ Shell Wrappers
β”‚   β”œβ”€β”€ run_prediction.sh                   # Prediction wrapper
β”‚   └── run_analysis.sh                     # Analysis wrapper
β”œβ”€β”€ πŸ“ Directories
β”‚   β”œβ”€β”€ data/hest_data/                     # Input data
β”‚   β”œβ”€β”€ model/                              # Pre-trained weights
β”‚   └── output/                             # Results output
└── πŸ“‹ Documentation
    β”œβ”€β”€ README.md                           # This file
    └── Workflow.png                        # Workflow diagram

Overview

  • Model: Hist2ST combines CNN/Transformer for global context and GNN for local spatial structure; predicts gene expression (ZINB/NB heads).
  • Our contribution: A universal, sample-agnostic inference + analysis pipeline for HEST data with clean outputs and shell wrappers.
  • Core focus: Spatial gene expression visualization - comparing TRUE vs PREDICTED spatial patterns across tissue sections.
  • ⚠️ CRITICAL LIMITATIONS:
    • Species-specific: Only works with human samples (trained on human HER2+ breast cancer)
    • Gene set dependency: Requires specific human gene identifiers (18,085 genes, first 785 used)
    • Cross-species failure: Cannot process mouse, rat, or other mammalian samples
    • Limited generalization: Performance varies significantly across different human datasets
  • Upstream usage: Training/tutorial notebooks from the original repo are kept for reference.

Quick Start (External HEST Inference)

# 1) Ensure data + model are present
#    data/hest_data/st/{SAMPLE_ID}.h5ad
#    data/hest_data/wsis/{SAMPLE_ID}.tif
#    model/5-Hist2ST.ckpt

# 2) Make scripts executable (first time only)
chmod +x run_prediction.sh run_analysis.sh

# 3) Run prediction and analysis
./run_prediction.sh MEND159
./run_analysis.sh MEND159

Input layout

data/hest_data/
β”œβ”€β”€ st/{SAMPLE_ID}.h5ad          # counts + spatial
└── wsis/{SAMPLE_ID}.tif         # H&E fallback image

model/
└── 5-Hist2ST.ckpt               # pretrained weights

Optional gene list: data/her_hvg_cut_1000.npy (first 785 used if present).

Output layout

output/{SAMPLE_ID}/
β”œβ”€β”€ predictions/
β”‚   β”œβ”€β”€ {SAMPLE_ID}_pred.h5ad         # 785-gene predictions
β”‚   β”œβ”€β”€ {SAMPLE_ID}_expr_true_aligned.csv  # aligned true expression
β”‚   β”œβ”€β”€ {SAMPLE_ID}_expr_pred_aligned.csv  # aligned predicted expression
β”‚   β”œβ”€β”€ {SAMPLE_ID}_coords_aligned.csv     # aligned coordinates
β”‚   └── correlation_results.npy       # Pearson/Spearman + overlap genes
β”œβ”€β”€ analysis/
β”‚   β”œβ”€β”€ {SAMPLE_ID}_analyzed.h5ad     # processed AnnData
β”‚   β”œβ”€β”€ {SAMPLE_ID}_common_genes_used.csv  # genes used for analysis
β”‚   β”œβ”€β”€ {SAMPLE_ID}_all_spots_correlations.csv  # spot-level correlations
β”‚   β”œβ”€β”€ {SAMPLE_ID}_selected_spots_correlations.csv  # selected spots
β”‚   β”œβ”€β”€ {SAMPLE_ID}_top_genes_spot_correlations.csv  # top genes correlations
β”‚   β”œβ”€β”€ clustering_results.csv
β”‚   └── marker_genes.csv
β”œβ”€β”€ visualizations/                   # **Core: Spatial gene expression plots**
β”‚   β”œβ”€β”€ {SAMPLE_ID}_spatial_*_comparison.png # **TRUE vs PRED spatial patterns**
β”‚   β”œβ”€β”€ {SAMPLE_ID}_spatial_true_summary.png # **TRUE spatial summary**
β”‚   β”œβ”€β”€ {SAMPLE_ID}_spatial_pred_summary.png # **PRED spatial summary**
β”‚   β”œβ”€β”€ {SAMPLE_ID}_spot_scatter_*.png       # individual spot scatter plots
β”‚   β”œβ”€β”€ {SAMPLE_ID}_multi_spot_scatter_plots.png  # multi-spot summary
β”‚   β”œβ”€β”€ {SAMPLE_ID}_spatial_correlation_map.png   # spatial correlation map
β”‚   β”œβ”€β”€ {SAMPLE_ID}_spot_correlation_distributions.png  # correlation distributions
β”‚   β”œβ”€β”€ {SAMPLE_ID}_true_umap_shared.png     # (Optional) TRUE UMAP
β”‚   └── {SAMPLE_ID}_pred_umap_shared.png     # (Optional) PRED UMAP
└── logs/                             # pipeline logs

Commands and scripts

Core Pipeline

  • run_prediction.sh SAMPLE_ID β€” shell wrapper for inference
  • run_analysis.sh SAMPLE_ID β€” shell wrapper for downstream analysis
  • predict_hest_universal.py β€” universal prediction (loads .h5ad + .tif, builds KNN graph, runs Hist2ST)
  • analyze_hest_universal.py β€” QC, HVG, PCA/UMAP/t-SNE, clustering, DE, spatial plots

Core Analysis Scripts (Spatial Visualization)

  • spatial_gene_expression_comparison.R β€” 🎯 PRIMARY: Spatial gene expression visualization

    • Compares TRUE vs PREDICTED spatial patterns for selected genes
    • Generates spatial feature plots for top variable genes
    • No UMAP - focuses purely on spatial visualization
    • Usage: Rscript spatial_gene_expression_comparison.R SAMPLE_ID output NUM_GENES
  • spot_based_gene_correlation.R β€” 🎯 PRIMARY: Spot-level gene correlation analysis

    • X-axis: predicted gene expression, Y-axis: true gene expression
    • Each point represents a gene within a specific spot
    • Analyzes correlation patterns across spatial locations
    • Generates spatial correlation maps
    • Usage: Rscript spot_based_gene_correlation.R SAMPLE_ID output NUM_SPOTS

Optional Analysis Scripts

  • seurat_compare_true_vs_pred.R β€” Optional: UMAP comparison using identical gene sets
    • Uses exactly the same genes for both datasets
    • Generates UMAP plots, correlation histograms
    • Ensures fair comparison across samples
    • Usage: Rscript seurat_compare_true_vs_pred.R SAMPLE_ID output

Advanced (Python flags):

python predict_hest_universal.py SAMPLE_ID \
  --device auto --data_dir data/hest_data --output_dir output

Analysis Workflow

1. Core Analysis (Recommended)

# Run prediction and basic analysis
./run_prediction.sh MEND159
./run_analysis.sh MEND159

# **Primary: Spatial gene expression visualization**
Rscript spatial_gene_expression_comparison.R MEND159 output 10

# **Primary: Spot-based gene correlation analysis**
Rscript spot_based_gene_correlation.R MEND159 output 20

2. Optional UMAP Analysis

# UMAP comparison with identical gene sets (optional)
Rscript seurat_compare_true_vs_pred.R MEND159 output

3. Custom Analysis

# Analyze different number of genes/spots
Rscript spatial_gene_expression_comparison.R MEND159 output 15
Rscript spot_based_gene_correlation.R MEND159 output 50

4. Successfully Tested Examples

# INT22 (665 common genes) - Complete workflow
./run_prediction.sh INT22
./run_analysis.sh INT22
Rscript streamlined_spatial_analysis.R INT22 output 6
Rscript generate_qc_plots.R INT22
Rscript analyze_ncount_differences.R INT22

# MEND159 (779 common genes) - Complete workflow  
./run_prediction.sh MEND159
./run_analysis.sh MEND159
Rscript streamlined_spatial_analysis.R MEND159 output 6
Rscript generate_qc_plots.R MEND159
Rscript analyze_ncount_differences.R MEND159

Key Features

🎯 Core Focus: Spatial Visualization

  • Spatial Gene Expression: TRUE vs PREDICTED patterns across tissue
  • Spot-level Analysis: Individual spatial point correlation analysis
  • Spatial Correlation Maps: Geographic distribution of prediction quality
  • No UMAP Required: Focus on spatial patterns, not dimensionality reduction

πŸ”§ Robust Gene Set Management

  • Unified Analysis: All scripts ensure TRUE and PREDICTED data use identical gene sets
  • Sample Agnostic: Works with any sample regardless of gene count
  • Consistent Comparison: Fair evaluation across different datasets

πŸ“Š Comprehensive Analysis Types

  1. Spatial Visualization: Gene expression patterns across tissue (Primary)
  2. Spot-level Correlation: Individual spatial point analysis (Primary)
  3. Spatial Correlation Maps: Geographic distribution of prediction quality (Primary)
  4. UMAP Comparison: Clustering analysis with shared features (Optional)

πŸ“ˆ Quality Metrics

  • Gene-wise correlations: Pearson/Spearman across all genes
  • Spot-wise correlations: Individual spatial point performance
  • Spatial patterns: Geographic distribution of prediction accuracy
  • Distribution analysis: Overall model performance statistics

Technical notes

  • Config: 5-7-2-8-4-16-32, n_genes=785, dropout=0.2
  • Weights: loaded with strict=False to allow partial compatibility
  • Graph: k=6; dynamically switches pruneTag (Grid/NA) by coordinate range
  • Coordinates: normalized to integer indices (0–63) for embeddings
  • Seeds fixed (12000) for reproducibility
  • Gene Set Consistency: All analyses use intersect() to ensure identical genes

Minimal model usage (reference)

import torch
from HIST2ST import Hist2ST

model = Hist2ST(depth1=2, depth2=8, depth3=4,
                n_genes=785, kernel_size=5, patch_size=7,
                heads=16, channel=32, dropout=0.2,
                zinb=0.25, nb=False, bake=5, lamb=0.5)
# patches: [B, N, 3, H, W]
# coords:  [B, N, 2] (long indices 0..63)
# adj:     [N, N]
# out:     [B, N, n_genes]

Requirements

  • Python >= 3.7, PyTorch >= 1.10, pytorch-lightning >= 1.4, scanpy >= 1.8, scipy, PIL, tqdm
  • R >= 4.0, Seurat >= 5.0, ggplot2, dplyr, gridExtra, corrplot

Troubleshooting

  • "Pre-trained model not found": put 5-Hist2ST.ckpt under model/
  • "No overlapping genes":
    • Most common cause: Non-human sample (mouse, rat, etc.)
    • Check species: Verify sample is human-derived
    • Check gene overlap: Run compatibility check from Quick Start
    • Solution: Use only human samples with >200 overlapping genes
  • "Species incompatibility":
    • Error: "No overlapping genes between model and {SAMPLE_ID}"
    • Cause: Model trained on human genes, sample has different species genes
    • Example: MEND73 (mouse, 32,285 genes) failed with only 18 overlapping genes
    • Solution: Use human samples only
  • Very low correlations: expected in zero-shot cross-dataset; predictions can still be useful
  • PIL DecompressionBombWarning: safe for large WSIs
  • R package errors: Install required R packages: install.packages(c("Seurat", "ggplot2", "dplyr", "gridExtra", "corrplot"))

Datasets (upstream)

  • HER2+ breast tumor ST: https://github.com/almaan/her2st
  • cSCC 10x Visium (GSE144240)
  • Synapse mirror of trained models and data indices (see upstream paper)

Citation (upstream)

Please cite the original authors:

@article{zengys,
  title={Spatial Transcriptomics Prediction from Histology jointly through Transformer and Graph Neural Networks},
  author={Yuansong Zeng and Zhuoyi Wei and Weijiang Yu and Rui Yin and Bingling Li and Zhonghui Tang and Yutong Lu and Yuedong Yang},
  journal={bioRxiv},
  year={2021},
  publisher={Cold Spring Harbor Laboratory}
}

⚠️ CRITICAL MODEL LIMITATIONS & COMPATIBILITY

Species Compatibility

  • βœ… SUPPORTED: Human samples only
  • ❌ NOT SUPPORTED: Mouse, rat, non-human primate, or any non-human species
  • Reason: Model trained exclusively on human HER2+ breast cancer samples

Gene Set Requirements

  • Required genes: 18,085 human genes from her_hvg_cut_1000.npy
  • Prediction output: Fixed 785 genes (first 785 from required gene set)
  • Minimum overlap: >200 overlapping genes for meaningful prediction
  • Gene naming: Must use human gene identifiers (e.g., SAMD11, NOC2L)

Compatibility Check

Before running prediction, verify sample compatibility:

# Check if sample is human and has sufficient gene overlap
python -c "
import scanpy as sc
import numpy as np
ad = sc.read_h5ad('data/hest_data/st/{SAMPLE_ID}.h5ad')
expected = np.load('data/her_hvg_cut_1000.npy')
overlap = set(ad.var_names) & set(expected)
print(f'Sample genes: {len(ad.var_names)}')
print(f'Expected genes: {len(expected)}')
print(f'Overlap: {len(overlap)} genes')
print(f'Compatible: {len(overlap) > 200}')
"

Known Incompatible Samples

  • MEND73: Mouse sample, 32,285 genes, only 18 overlapping β†’ ❌ FAILED
  • Other mouse/rat samples: Expected to fail due to species mismatch

Known Compatible Samples (Successfully Tested)

  • INT22: Human sample β†’ βœ… SUCCESS (665 common genes, 3,382 spots)
  • MEND159: Human sample β†’ βœ… SUCCESS (779 common genes, 3,382 spots)

Success Rate Summary

  • Human samples: 2/2 (100% success)
  • Non-human samples: 1/1 (0% success)
  • Overall: 67% (2/3 samples)

Recommended Workflow

  1. Always check compatibility first using the command above
  2. Use only human samples for prediction
  3. Verify gene overlap >200 genes before proceeding
  4. Have alternative methods ready for non-human samples

Alternative Solutions for Non-Human Samples

  • Species-specific models: Train new models on target species
  • Gene orthology mapping: Map genes between species
  • Custom training: Retrain Hist2ST on target datasets
  • Other tools: Use species-agnostic spatial analysis methods

Evaluation Protocol (Unified)

  • TRUE: library-size normalization + log1p
  • PRED: used as produced by Hist2ST (continuous; may include small negatives)
  • Metrics:
    • Gene-wise Pearson (across spots) as primary (paper-consistent)
    • Spot-wise Pearson (across genes) as complementary
  • Strict intersection of genes and spots; identical ordering ensured

Current Results Summary

Sample,Type,Mean,Median,Min,Max,N
A2,Gene-wise (Pearson),0.1500535903892705,0.1458614533150753,-0.1009585743451398,0.5605070053736968,785
A2,Spot-wise (Pearson),0.5211877134030255,0.527795645381485,-0.004018497100146,0.591066113056179,325
INT22,Gene-wise (Pearson),-0.013081627379449,-0.0125343126866175,-0.1120029450309284,0.0938526612949563,737
INT22,Spot-wise (Pearson),-0.0064532180821096,-0.0063908200822004,-0.107613361046767,0.115441141597783,3829
MEND159,Gene-wise (Pearson),-0.0005007817283359,0.0004339587172078,-0.0654807693105191,0.061827421947797,665
MEND159,Spot-wise (Pearson),-0.0089678414943827,-0.0115978723766035,-0.100886699853229,0.152289124472616,3382
TENX13,Gene-wise (Pearson),-0.0392731324528164,-0.0381869765576874,-0.1563039807115384,0.0736455655134945,713
TENX13,Spot-wise (Pearson),-0.0228815994741047,-0.0238930556419763,-0.106568189364158,0.10093894225413,3813

Reproduce on your machine

  • Prediction: python predict_hest_universal.py SAMPLE_ID --data_dir data/hest_data/SAMPLE_ID --output_dir output --device auto
  • Align for analysis: python align_data_for_analysis.py SAMPLE_ID
  • Spot-wise analysis: Rscript spot_based_gene_correlation.R SAMPLE_ID output 8 hist2st
  • QC plots: Rscript generate_qc_plots.R SAMPLE_ID

Notes

  • Large raw data (h5ad/WSI/ckpt) are excluded by .gitignore.
  • Analysis outputs (CSVs, PNGs) are included when size permits.

About

Universal inference pipeline for Hist2ST model with comprehensive spatial transcriptomics analysis tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published