This document captures the development direction for stGPT after reviewing closely related spatial transcriptomics, histopathology, agentic workbench, and foundation-model methods. It is a working engineering guide, not a full literature review. The landscape snapshot is current as of 2026-04-30.
stGPT should be developed as a Xenium-first morpho-molecular foundation-model backend in development for spatial transcriptomics. The model should treat each cell or local spatial unit as a multimodal sequence built from:
- gene identity tokens and expression-value/bin embeddings
- trainable H&E patch embeddings
- spatial coordinate embeddings
- optional structure or pathology-context tokens from spatho-style evidence
The core pretraining and fine-tuning objectives should remain aligned with this positioning:
- masked gene reconstruction to learn within-cell molecular structure
- neighborhood reconstruction to capture spatial co-localization and tissue context
- image-gene contrastive learning to align morphology with expression
- compact cell or region embeddings for downstream pathology and spatial biology workflows
This makes stGPT different from a pure H&E-to-expression regressor. The goal is to build a reusable representation model for Xenium-centered spatial pathology, with expression prediction as one important evaluation task rather than the whole product.
The stronger platform framing is a closed evidence loop:
Model -> Evidence -> Agent -> Human Review -> Better Model
In that loop, stGPT owns representation learning and measured evaluation, while spatho owns the agentic spatial pathology workbench that turns embeddings, QC, retrieval, and benchmarks into auditable biological evidence. This mirrors current production-agent practice: tools need schemas, traces, guardrails, evaluation, and human review rather than only a chat interface. Relevant references for the workbench layer include Google ADK, the MCP tools specification, OpenAI Agents tracing, OpenAI guardrails, and MLflow GenAI evaluation judges.
The shared platform sentence for the two repositories is:
stGPT learns reusable contour/region morpho-molecular representations; spatho plans, validates, and turns them into auditable spatial pathology evidence.
stgpt.foundation: training, model architecture, checkpoint loading, embedding, and model packaging.stgpt.evidence: QC, deterministic splits, evaluation, ablations, domain-shift checks, and failure analysis.stgpt.runtime: schema-first tool API for downstream systems, starting withvalidate_case,embed_regions,summarize_structures,evaluate_checkpoint,package_model, andexport_spatho_artifacts.spatho.workbench: the agentic workflow layer that plans analysis, checks guardrails, calls tools, and organizes evidence.spatho.reports: reproducible report assembly that distinguishes measured data from model-derived evidence.
stgpt.runtime should be treated as a typed tool surface, not just a collection of CLI scripts. The implementation should remain callable from Python, mirrored by CLI commands, and later wrapped as MCP or agent tools only after the Python contract is stable.
Recommended stable tool surface:
validate_case: validate one case, write QC reports, and return fatal errors, warnings, fingerprints, and split references.embed_regions: embed contour or spatial-region units and write region-first artifacts.summarize_structures: aggregate region evidence into structure-level summaries for workbench consumption.retrieve_regions: find similar regions from a region, marker query, image crop, or embedding.compare_regions: compare regions with molecular, morphology, spatial, and QC summaries.score_niche: score configured niche signatures with measured and model-derived fields separated.explain_region: produce an evidence-ID-linked explanation for one region without making clinical claims.export_spatho_artifacts: write the spatho-compatible artifact package and remain the first stable handshake withspatho.
Every runtime output should carry evidence IDs, input/config/checkpoint fingerprints, QC verdicts, warnings, and audit metadata. This lets spatho act as a planner, critic, reporter, and human-handoff layer instead of reading raw vectors directly.
The downstream loop is:
Plan -> Tool Calls -> QC/Critic -> Evidence Graph -> Report -> Human Review -> Model Improvement
scGPT-spatialextends scGPT through continual pretraining for spatial transcriptomics, with spatially aware sampling and neighborhood-oriented objectives. The relevant preprint is scGPT-spatial: Continual Pretraining of Single-Cell Foundation Model for Spatial Transcriptomics. This is the closest gene-token foundation-model reference, but it does not make trainable H&E patch context the central input.STPathis a generative foundation model for integrating spatial transcriptomics and whole-slide images. It uses a geometry-aware Transformer and masked gene expression prediction over large-scale WSI-ST data. It validates that masked generative objectives are now a strong direction for ST-pathology models.STORMis a multimodal foundation model of spatial transcriptomics and histology for biological discovery and clinical prediction. Its platform-agnostic framing across Visium, Xenium, Visium HD, and CosMx is an important signal that cross-platform evaluation will matter.ST-Alignis an image-gene alignment foundation model for spatial transcriptomics. It emphasizes spatial context, spot-niche alignment, multi-scale alignment, and few-shot/zero-shot transfer. This supports the need for image-gene alignment instGPT, whilestGPTshould keep a tighter Xenium-native reconstruction objective.OmiCLIP/Lokibuilds a visual-omics foundation model that bridges H&E histology and spatial transcriptomics, then uses the aligned space for tissue alignment, annotation, retrieval, cell-type decomposition, and ST expression prediction. This is the strongest CLIP-style reference for cross-modal image-expression retrieval.SEALperforms Spatial Expression-Aligned Learning as parameter-efficient ST-guided fine-tuning of pathology vision encoders. The gated model card is available atMahmoodLab/SEAL. SEAL supports the idea that localized molecular supervision improves pathology encoders, but its primary product is a better vision model rather than a gene-token GPT.
H&Eniumaligns H&E image embeddings and transcriptomic foundation embeddings at single-cell resolution with contrastive learning. It is an important single-cell Xenium-adjacent reference for alignment, but it is closer to an embedding-alignment framework than a unified generative sequence model.xMINTis a Multimodal Integration Transformer for Xenium gene imputation. It is directly relevant to Xenium panel expansion and imputation, but its task scope is narrower than the desiredstGPTrepresentation-learning agenda.DiffBulkuses diffusion-based training to improve spatial transcriptomic prediction. It is useful as a generative baseline and a reminder that expression-space generation may compete with Transformer reconstruction objectives.PASTis a multimodal single-cell foundation model for histopathology and spatial transcriptomics in cancer. It overlaps withstGPTon high-resolution image-to-expression prediction and virtual molecular staining, but is broader in pan-cancer scope.
SToFMis a multi-scale foundation model for spatial transcriptomics that highlights macro tissue morphology, microenvironment, and gene-scale modeling.Nicheformeris a foundation model for single-cell and spatial omics that transfers spatial context into cell representations.Novaeis a graph-based foundation model for spatial transcriptomics, trained across large multi-tissue cell collections.CellNicherepresents cellular microenvironments in atlas-scale spatial omics data with contrastive learning.
These methods may not all use H&E as a core modality, but they define the baseline expectations for spatial context, niche representation, graph structure, and cross-tissue generalization.
HESCAPEis a large-scale benchmark for cross-modal learning between histology and gene expression in spatial transcriptomics. The Hugging Face dataset isPeng-AI/hescape-pyarrow. Its key warning forstGPTis that gene encoders and batch effects can dominate cross-modal learning quality.HEST-1kprovides a large histology-ST dataset with aligned whole-slide images and spatial transcriptomics profiles.STimage-1K4Mprovides histopathology image-gene expression pairs for spatial transcriptomics research.
stGPTshould not compete as another generic H&E-to-expression predictor. That space already includes strong large-scale models and specialized prediction methods.- The project should differentiate on Xenium-native modeling: cell-level or subcellular-resolution assumptions, panel-aware gene vocabularies, imaging-based ST quirks, and practical adapter quality.
- The model should fuse H&E, spatial, structure/context, and gene tokens inside a unified Transformer rather than relying only on late feature concatenation.
- The training recipe should preserve both reconstruction and alignment objectives: masked gene reconstruction, neighborhood reconstruction, and image-gene contrastive loss should remain first-class.
- spatho-derived H&E patch manifests and structure assignments should become a strategic advantage, because they provide explicit pathology context that many image-gene models leave implicit.
- Benchmarking should separate three claims: expression prediction, representation quality, and pathology/spatial biology utility. A single aggregate score will hide important failures.
- Build robust Xenium ingestion and validation first: coordinates, gene names, panel metadata, cell IDs, optional morphology assets, and reproducible AnnData export.
- Make
XeniumSlidethe canonical real-data contract before model scaling: sparse cell-gene matrix, centroid/boundary geometry, panel metadata, aligned H&E transform metadata, contour polygons, cell-to-contour assignments, and batch/slide/patient/organ/stain/scanner metadata should travel together. - Use contour-segmented H&E crops as the first real image learning units for Atera Xenium, not per-cell crops. This keeps image context auditable and aligned with spatho-style structure evidence.
- Treat
stgpt validate-dataas the first real-data gate: it should write a case manifest, QC reports, and deterministic splits before any paper-facing training run. - Treat
stgpt evaluateas the second gate: it should consume the QC split file and write reconstruction, retrieval, and embedding-quality artifacts for every paper-facing checkpoint. - Package successful checkpoints as spatho-compatible model backends with
stgpt package-model,stgpt spatho-embed, and thestgpt.runtimeAPI, keeping the externalspathopackage optional. - Treat
stgpt.runtime.export_spatho_artifacts(config, checkpoint, output_dir, batch_size=32, device="auto")as the first stable integration point forspatho. - Make patch and structure manifests reproducible: every embedding should be traceable to image coordinates, patch extraction parameters, registration metadata, and any spatho-derived structure labels.
- Prefer region-first exports for workbench use:
region_embeddings.parquet,region_cell_membership.parquet,region_molecular_summary.parquet,region_image_manifest.json,region_qc_report.json, andevidence_manifest.json. - Implement baseline comparisons against the closest method families: scGPT-spatial-style gene/spatial objectives, STPath/STORM-style masked expression prediction, ST-Align/OmiCLIP-style contrastive alignment, and xMINT-style Xenium imputation.
- Treat objective ablations as required evidence:
stgpt train --ablation gene_only,image_only,spatial_only,image_gene,image_gene_spatial, andfullshould be run from the same data split before making claims. - Add explicit handling for batch effects and domain shift: case-level splits, slide-level splits, organ/tissue holdouts, platform holdouts where possible, and staining variation checks.
- Define a panel and vocabulary strategy: fixed panel vocabularies for Xenium smoke tests, configurable gene vocabularies for real studies, and clear behavior for missing or out-of-panel genes.
- Keep failure analysis next to metrics: every evaluation should report patch coverage, missing images, registration traceability, panel mismatch, and available batch/slide/domain keys.
- Keep the public package practical: CPU smoke tests, small synthetic fixtures, documented real-data adapters, and compact exported embeddings for downstream pathology workflows.
- Keep guardrails explicit:
spathoshould not generate biological conclusions from stGPT evidence when QC reports fatal errors, and model-derived imputation or reconstruction must never be labeled as measured expression.
- Batch effects may dominate image-gene alignment. HESCAPE-style evaluation should be used to detect whether the model learns biology or site/platform artifacts.
- Platform heterogeneity matters. Visium spots, Visium HD bins, Xenium cells, CosMx cells, and MERFISH-style assays differ in resolution, panel design, sparsity, segmentation, and image registration assumptions.
- Xenium is not whole-transcriptome by default. Gene reconstruction and imputation claims must distinguish panel reconstruction from whole-transcriptome prediction.
- H&E registration quality is a major failure mode. The development workflow should record image alignment assumptions and expose quality-control hooks rather than treating image patches as automatically correct.
- Ablation comparisons are only valid when they reuse the same QC-generated split file, seed, panel policy, and patch provenance contract.
- Gated and non-commercial datasets/models may limit reproducibility. Public smoke tests and open synthetic fixtures should remain part of the core repo even when larger benchmarks use restricted assets.
- Large foundation models may outperform
stGPTon generic expression prediction. The project should win by being transparent, Xenium-aware, easy to run, and useful for downstream spatial pathology evidence generation.
The next development phase should be considered successful when stGPT can:
- load a real Xenium case through the optional adapter
- build the Atera Breast and Cervical WTA cases into
XeniumSlidestores before training - validate the case with
stgpt validate-dataand inspect the QC report before training - attach reproducible H&E patch and structure/context metadata
- train the image-gene Transformer with reconstruction and contrastive objectives
- evaluate the checkpoint with the QC split file instead of ad hoc random splits
- write a failure-analysis artifact covering patch, registration, panel, and split/domain risks
- package the checkpoint and export spatho-compatible cell, patch, and structure embedding artifacts
- export cell or region embeddings with enough metadata for downstream analysis
- run smoke tests without private data
- report baseline and ablation results that make the strategic claims above testable