Development Strategy: Xenium-First Image-Gene GPT for Spatial Transcriptomics

This document captures the development direction for stGPT after reviewing closely related spatial transcriptomics, histopathology, agentic workbench, and foundation-model methods. It is a working engineering guide, not a full literature review. The landscape snapshot is current as of 2026-04-30.

Project Positioning

stGPT should be developed as a Xenium-first morpho-molecular foundation-model backend in development for spatial transcriptomics. The model should treat each cell or local spatial unit as a multimodal sequence built from:

gene identity tokens and expression-value/bin embeddings
trainable H&E patch embeddings
spatial coordinate embeddings
optional structure or pathology-context tokens from spatho-style evidence

The core pretraining and fine-tuning objectives should remain aligned with this positioning:

masked gene reconstruction to learn within-cell molecular structure
neighborhood reconstruction to capture spatial co-localization and tissue context
image-gene contrastive learning to align morphology with expression
compact cell or region embeddings for downstream pathology and spatial biology workflows

This makes stGPT different from a pure H&E-to-expression regressor. The goal is to build a reusable representation model for Xenium-centered spatial pathology, with expression prediction as one important evaluation task rather than the whole product.

The stronger platform framing is a closed evidence loop:

Model -> Evidence -> Agent -> Human Review -> Better Model

In that loop, stGPT owns representation learning and measured evaluation, while spatho owns the agentic spatial pathology workbench that turns embeddings, QC, retrieval, and benchmarks into auditable biological evidence. This mirrors current production-agent practice: tools need schemas, traces, guardrails, evaluation, and human review rather than only a chat interface. Relevant references for the workbench layer include Google ADK, the MCP tools specification, OpenAI Agents tracing, OpenAI guardrails, and MLflow GenAI evaluation judges.

The shared platform sentence for the two repositories is:

stGPT learns reusable contour/region morpho-molecular representations; spatho plans, validates, and turns them into auditable spatial pathology evidence.

Product Layers

stgpt.foundation: training, model architecture, checkpoint loading, embedding, and model packaging.
stgpt.evidence: QC, deterministic splits, evaluation, ablations, domain-shift checks, and failure analysis.
stgpt.runtime: schema-first tool API for downstream systems, starting with validate_case, embed_regions, summarize_structures, evaluate_checkpoint, package_model, and export_spatho_artifacts.
spatho.workbench: the agentic workflow layer that plans analysis, checks guardrails, calls tools, and organizes evidence.
spatho.reports: reproducible report assembly that distinguishes measured data from model-derived evidence.

Agentic Runtime Contract

stgpt.runtime should be treated as a typed tool surface, not just a collection of CLI scripts. The implementation should remain callable from Python, mirrored by CLI commands, and later wrapped as MCP or agent tools only after the Python contract is stable.

Recommended stable tool surface:

validate_case: validate one case, write QC reports, and return fatal errors, warnings, fingerprints, and split references.
embed_regions: embed contour or spatial-region units and write region-first artifacts.
summarize_structures: aggregate region evidence into structure-level summaries for workbench consumption.
retrieve_regions: find similar regions from a region, marker query, image crop, or embedding.
compare_regions: compare regions with molecular, morphology, spatial, and QC summaries.
score_niche: score configured niche signatures with measured and model-derived fields separated.
explain_region: produce an evidence-ID-linked explanation for one region without making clinical claims.
export_spatho_artifacts: write the spatho-compatible artifact package and remain the first stable handshake with spatho.

Every runtime output should carry evidence IDs, input/config/checkpoint fingerprints, QC verdicts, warnings, and audit metadata. This lets spatho act as a planner, critic, reporter, and human-handoff layer instead of reading raw vectors directly.

The downstream loop is:

Plan -> Tool Calls -> QC/Critic -> Evidence Graph -> Report -> Human Review -> Model Improvement

Current Method Landscape

Closest strategic neighbors

scGPT-spatial extends scGPT through continual pretraining for spatial transcriptomics, with spatially aware sampling and neighborhood-oriented objectives. The relevant preprint is scGPT-spatial: Continual Pretraining of Single-Cell Foundation Model for Spatial Transcriptomics. This is the closest gene-token foundation-model reference, but it does not make trainable H&E patch context the central input.
STPath is a generative foundation model for integrating spatial transcriptomics and whole-slide images. It uses a geometry-aware Transformer and masked gene expression prediction over large-scale WSI-ST data. It validates that masked generative objectives are now a strong direction for ST-pathology models.
STORM is a multimodal foundation model of spatial transcriptomics and histology for biological discovery and clinical prediction. Its platform-agnostic framing across Visium, Xenium, Visium HD, and CosMx is an important signal that cross-platform evaluation will matter.
ST-Align is an image-gene alignment foundation model for spatial transcriptomics. It emphasizes spatial context, spot-niche alignment, multi-scale alignment, and few-shot/zero-shot transfer. This supports the need for image-gene alignment in stGPT, while stGPT should keep a tighter Xenium-native reconstruction objective.
OmiCLIP/Loki builds a visual-omics foundation model that bridges H&E histology and spatial transcriptomics, then uses the aligned space for tissue alignment, annotation, retrieval, cell-type decomposition, and ST expression prediction. This is the strongest CLIP-style reference for cross-modal image-expression retrieval.
SEAL performs Spatial Expression-Aligned Learning as parameter-efficient ST-guided fine-tuning of pathology vision encoders. The gated model card is available at MahmoodLab/SEAL. SEAL supports the idea that localized molecular supervision improves pathology encoders, but its primary product is a better vision model rather than a gene-token GPT.

Xenium-specific and task-specific neighbors

H&Enium aligns H&E image embeddings and transcriptomic foundation embeddings at single-cell resolution with contrastive learning. It is an important single-cell Xenium-adjacent reference for alignment, but it is closer to an embedding-alignment framework than a unified generative sequence model.
xMINT is a Multimodal Integration Transformer for Xenium gene imputation. It is directly relevant to Xenium panel expansion and imputation, but its task scope is narrower than the desired stGPT representation-learning agenda.
DiffBulk uses diffusion-based training to improve spatial transcriptomic prediction. It is useful as a generative baseline and a reminder that expression-space generation may compete with Transformer reconstruction objectives.
PAST is a multimodal single-cell foundation model for histopathology and spatial transcriptomics in cancer. It overlaps with stGPT on high-resolution image-to-expression prediction and virtual molecular staining, but is broader in pan-cancer scope.

Related spatial foundation models

SToFM is a multi-scale foundation model for spatial transcriptomics that highlights macro tissue morphology, microenvironment, and gene-scale modeling.
Nicheformer is a foundation model for single-cell and spatial omics that transfers spatial context into cell representations.
Novae is a graph-based foundation model for spatial transcriptomics, trained across large multi-tissue cell collections.
CellNiche represents cellular microenvironments in atlas-scale spatial omics data with contrastive learning.

These methods may not all use H&E as a core modality, but they define the baseline expectations for spatial context, niche representation, graph structure, and cross-tissue generalization.

Benchmarks and datasets to track

HESCAPE is a large-scale benchmark for cross-modal learning between histology and gene expression in spatial transcriptomics. The Hugging Face dataset is Peng-AI/hescape-pyarrow. Its key warning for stGPT is that gene encoders and batch effects can dominate cross-modal learning quality.
HEST-1k provides a large histology-ST dataset with aligned whole-slide images and spatial transcriptomics profiles.
STimage-1K4M provides histopathology image-gene expression pairs for spatial transcriptomics research.

Strategic Judgments

stGPT should not compete as another generic H&E-to-expression predictor. That space already includes strong large-scale models and specialized prediction methods.
The project should differentiate on Xenium-native modeling: cell-level or subcellular-resolution assumptions, panel-aware gene vocabularies, imaging-based ST quirks, and practical adapter quality.
The model should fuse H&E, spatial, structure/context, and gene tokens inside a unified Transformer rather than relying only on late feature concatenation.
The training recipe should preserve both reconstruction and alignment objectives: masked gene reconstruction, neighborhood reconstruction, and image-gene contrastive loss should remain first-class.
spatho-derived H&E patch manifests and structure assignments should become a strategic advantage, because they provide explicit pathology context that many image-gene models leave implicit.
Benchmarking should separate three claims: expression prediction, representation quality, and pathology/spatial biology utility. A single aggregate score will hide important failures.

Development Priorities

Build robust Xenium ingestion and validation first: coordinates, gene names, panel metadata, cell IDs, optional morphology assets, and reproducible AnnData export.
Make XeniumSlide the canonical real-data contract before model scaling: sparse cell-gene matrix, centroid/boundary geometry, panel metadata, aligned H&E transform metadata, contour polygons, cell-to-contour assignments, and batch/slide/patient/organ/stain/scanner metadata should travel together.
Use contour-segmented H&E crops as the first real image learning units for Atera Xenium, not per-cell crops. This keeps image context auditable and aligned with spatho-style structure evidence.
Treat stgpt validate-data as the first real-data gate: it should write a case manifest, QC reports, and deterministic splits before any paper-facing training run.
Treat stgpt evaluate as the second gate: it should consume the QC split file and write reconstruction, retrieval, and embedding-quality artifacts for every paper-facing checkpoint.
Package successful checkpoints as spatho-compatible model backends with stgpt package-model, stgpt spatho-embed, and the stgpt.runtime API, keeping the external spatho package optional.
Treat stgpt.runtime.export_spatho_artifacts(config, checkpoint, output_dir, batch_size=32, device="auto") as the first stable integration point for spatho.
Make patch and structure manifests reproducible: every embedding should be traceable to image coordinates, patch extraction parameters, registration metadata, and any spatho-derived structure labels.
Prefer region-first exports for workbench use: region_embeddings.parquet, region_cell_membership.parquet, region_molecular_summary.parquet, region_image_manifest.json, region_qc_report.json, and evidence_manifest.json.
Implement baseline comparisons against the closest method families: scGPT-spatial-style gene/spatial objectives, STPath/STORM-style masked expression prediction, ST-Align/OmiCLIP-style contrastive alignment, and xMINT-style Xenium imputation.
Treat objective ablations as required evidence: stgpt train --ablation gene_only, image_only, spatial_only, image_gene, image_gene_spatial, and full should be run from the same data split before making claims.
Add explicit handling for batch effects and domain shift: case-level splits, slide-level splits, organ/tissue holdouts, platform holdouts where possible, and staining variation checks.
Define a panel and vocabulary strategy: fixed panel vocabularies for Xenium smoke tests, configurable gene vocabularies for real studies, and clear behavior for missing or out-of-panel genes.
Keep failure analysis next to metrics: every evaluation should report patch coverage, missing images, registration traceability, panel mismatch, and available batch/slide/domain keys.
Keep the public package practical: CPU smoke tests, small synthetic fixtures, documented real-data adapters, and compact exported embeddings for downstream pathology workflows.
Keep guardrails explicit: spatho should not generate biological conclusions from stGPT evidence when QC reports fatal errors, and model-derived imputation or reconstruction must never be labeled as measured expression.

Risks and Evaluation

Batch effects may dominate image-gene alignment. HESCAPE-style evaluation should be used to detect whether the model learns biology or site/platform artifacts.
Platform heterogeneity matters. Visium spots, Visium HD bins, Xenium cells, CosMx cells, and MERFISH-style assays differ in resolution, panel design, sparsity, segmentation, and image registration assumptions.
Xenium is not whole-transcriptome by default. Gene reconstruction and imputation claims must distinguish panel reconstruction from whole-transcriptome prediction.
H&E registration quality is a major failure mode. The development workflow should record image alignment assumptions and expose quality-control hooks rather than treating image patches as automatically correct.
Ablation comparisons are only valid when they reuse the same QC-generated split file, seed, panel policy, and patch provenance contract.
Gated and non-commercial datasets/models may limit reproducibility. Public smoke tests and open synthetic fixtures should remain part of the core repo even when larger benchmarks use restricted assets.
Large foundation models may outperform stGPT on generic expression prediction. The project should win by being transparent, Xenium-aware, easy to run, and useful for downstream spatial pathology evidence generation.

Near-Term Development Definition of Done

The next development phase should be considered successful when stGPT can:

load a real Xenium case through the optional adapter
build the Atera Breast and Cervical WTA cases into XeniumSlide stores before training
validate the case with stgpt validate-data and inspect the QC report before training
attach reproducible H&E patch and structure/context metadata
train the image-gene Transformer with reconstruction and contrastive objectives
evaluate the checkpoint with the QC split file instead of ad hoc random splits
write a failure-analysis artifact covering patch, registration, panel, and split/domain risks
package the checkpoint and export spatho-compatible cell, patch, and structure embedding artifacts
export cell or region embeddings with enough metadata for downstream analysis
run smoke tests without private data
report baseline and ablation results that make the strategic claims above testable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Strategy: Xenium-First Image-Gene GPT for Spatial Transcriptomics

Project Positioning

Product Layers

Agentic Runtime Contract

Current Method Landscape

Closest strategic neighbors

Xenium-specific and task-specific neighbors

Related spatial foundation models

Benchmarks and datasets to track

Strategic Judgments

Development Priorities

Risks and Evaluation

Near-Term Development Definition of Done

FilesExpand file tree

strategy.md

Latest commit

History

strategy.md

File metadata and controls

Development Strategy: Xenium-First Image-Gene GPT for Spatial Transcriptomics

Project Positioning

Product Layers

Agentic Runtime Contract

Current Method Landscape

Closest strategic neighbors

Xenium-specific and task-specific neighbors

Related spatial foundation models

Benchmarks and datasets to track

Strategic Judgments

Development Priorities

Risks and Evaluation

Near-Term Development Definition of Done