spacenumbat is a haplotype-aware copy-number alterations (CNA) inference library for single-cell and spatial transcriptomics data.
spacenumbat is a Python porting of the R implementation of Numbat originally developed by Teng Gao and colleagues at the Kharchenko Lab.
Our implementation expands the original algorithm by including an optional spatial signal enhancement algorithm for the analysis of spatial transcriptomics data.
spacenumbat is compatible with the scverse ecosystem, and is developed by the λ Lab.
As the original R implementation, to infer tumor subclones and their CNA genotypes spacenumbat combines:
- Expression-derived CNA signal (gene-level count shifts),
- Allele-specific signal (allelic imbalance),
- Phylogenetic structure (clone relationships inferred from per-cell CNA posteriors),
To denoise segment-level CNA signals across spatial transcriptomics spots, we implemented a method to perform graph-based diffusion on a spatially constrained affinity graph, defined by the argument "spatial_method" = "cpr" in the main pipeline: spacenumbat.run_spacenumbat().
Spots were connected using the tissue graph adjacency map, and edge weights were modulated by a kernel of pairwise distance calculated between the CNAs probability vector of connected spots.
Let
denote the resulting weighted adjacency matrix, and let
be the node degrees. To reduce bias induced by nonuniform sampling density, we applied the anisotropic normalization of Coifman,
followed by row normalization to obtain a Markov transition matrix
initialized at
This procedure is a random walk with restart and yields a density-corrected, locality-preserving smoother that borrows information across neighboring spots while retaining fidelity to the original measurements. This regularization is aimed to enhances spatially coherent clonal CNA patterns and reduce technical noise without enforcing global homogenization.
spacenumbat is currently available for download at its GitHub repo.
Installation in a miniforge environment is suggested.
An env called space can be created with:
conda create -n space python=3.13 pipThe env can be accessed with:
conda activate spaceOnce in your env, the library can be istalled using pip in two ways:
Clone the library from GitHub with:
git clone https://github.com/lillux/spacenumbat.gitcd spacenumbatpip install -e .Install spacenumbat directly from GitHub:
pip install git+https://github.com/lillux/spacenumbat.git#egg=spacenumbatTo run the preprocessing step, consisting in SNPs pileup and allele phasing, the following tools are required:
samtools and cellsnp-lite can be installed with conda in your active env:
conda install samtools cellsnp-lite -c conda-forgeEagle2 can be found at the following link: Eagle2, where the Eagle_v2.4.1.tar.gz file can be download.
It contains the executable file eagle and the tables required by spacenumbat preprocessing.
At April 2026 some dependencies are outdated on conda, but can be installed through pip, specifically:
pip install spatialdata spatialdata_io spatialdata_plot squidpyTo perform SNPs pileup and allele phasing two reference panels are required:
# hg38
wget https://sourceforge.net/projects/cellsnp/files/SNPlist/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz# hg38
wget http://pklab.med.harvard.edu/teng/data/1000G_hg38.zipThe script spacenumbat/preprocessing/pileup_n_phase.py is used to perform allele data preprocessing.
pileup_n_phase.py has the following arguments:
--label: label for the current run. One per run.--samples: sample name(s). Used to create per-sample pileup directories and to name final output files.--bams: Path(s) to input BAM file(s). This is always required. The interpretation depends on the selected mode: one BAM per sample in default and bulk modes, or a BAM list in--smartseqmode.--barcodesPath(s) to barcode file(s). Required in default single-cell mode and in spatial transcriptomics data. Passed differently in--smartseqmode. Ignored in--bulkmode.--gmap: Path to the genetic map file. Used both byEagle2during phasing and later by Python to interpolate centiMorgan (cM) positions for SNPs. This is provided by theEagle2downloaded with the instruction above, inEagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz.--eagle: Path to the Eagle2 executable. The default assumes eagle is available in the shell PATH. If eagle is not available in shell PATH the correct path to eagle executable should be given.--snpvcf: Path to the candidate 1000G SNP VCF used by cellsnp-lite as the pileup target loci.--paneldir: Directory containing Eagle2 reference panel files, expected aschr1.genotypes.bcfthroughchr22.genotypes.bcf. This is the path to the directory in which the 1000G Reference Panel downloaded above had been decompressed.--outdir: Output directory where the script writes pileup results, phasing files, logs, and final allele-count tables.--ncores: Number of threads to use for bothcellsnp-liteandEagle2.
Example code to run the script in single-cell mode (this works for spatial transcriptomics and scATAC):
python /spacenumbat/preprocessing/pileup_n_phase.py \
--label sample1 \
--samples sample1 \
--bams sample1/outs/possorted_genome_bam.bam \
--barcodes sample1/outs/filtered_feature_bc_matrix/barcodes.tsv \
--gmap Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz \
--eagle Eagle_v2.4.1/eagle \
--snpvcf genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz \
--paneldir 1000G_hg38 \
--outdir path/to/out \
--ncores 16At the end of a succesfull run of preprocessing, in the directory specified in the --outdir argument there will be some directories and files, including a file called {--samples}_allele_counts.tsv.gz that is required for the spacenumbat pipeline.
The main entry point is:
spacenumbat.run_spacenumbat(...)(implemented inspacenumbat/main.py).
We may use this code as an example of running the spacenumbat pipeline after preprocessing:
import pandas as pd
import spacenumbat
import spatialdata_io
spaceranger_10x_outs_path = "sample1/outs"
sample_id = "sample1"
df_allele_path = "sample1_allele_counts.tsv". # path to the output file of the preprocessing step.
counts_mat_space = spatialdata_io.visium(spaceranger_10x_outs_path,
dataset_id = sample_id,
var_names_make_unique = False)
counts_mat = counts_mat_space.tables['table'].copy()
lambdas_ref = spacenumbat.data.ref_hca.copy()
df_allele = pd.read_table(df_allele_path, sep='\t')
current_out_path = "path/to/sample1_out"
ncores = 16
sn_out = spacenumbat.run_spacenumbat(count_mat=counts_mat.copy(),
lambdas_ref=lambdas_ref.copy(),
df_allele=df_allele.copy(),
genome="hg38",
ncores=ncores,
call_clonal_loh=True,
filter_hla_hg38=True,
out_dir=current_out_path,
max_entropy=0.8,
ncores_nni=ncores,
spatial=True,
)spacenumbat.run_spacenumbat(count_mat=counts_mat.copy(),
lambdas_ref=lambdas_ref.copy(),
df_allele=df_allele.copy(),
genome="hg38",
ncores=ncores,
call_clonal_loh=True,
filter_hla_hg38=True,
out_dir=current_out_path,
max_entropy=0.8,
ncores_nni=ncores,
spatial=True,
)count_mat(anndata.AnnData): expression count matrix (cells × genes inAnnDataconvention).lambdas_ref(DataFrame): reference normalized expression profile(s). A reference profile is integrated in the library, and can be found atspacenumbat.data.ref_hca.
It is recommendend to use a reference profiles of euploid samples obtained with the same sequencing technology of the samples to be analyzed.df_allele(DataFrame): per-cell allele counts from the allele preprocessing workflow.
gtf=Noneandgenome in {"hg38", "hg19", "mm10"}uses packaged annotation tables.- If custom
gtfis provided, it is validated and used directly.
min_LLR: confidence threshold for CNA retention (higher = stricter).min_overlap: agreement requirement when deriving consensus segments.max_entropy: filters uncertain single-cell CNA calls before phylogeny. Default to 0.5.
It is recommended to increase it (eg. to 0.8) when analyzing spatial trascriptomics samples with low resolution (big spot with signal from multiple cells, eg. 10X Visium).min_genes: minimum genes per segment for stable calls.gamma,t,nu: model parameters controlling allele dispersion, transition rate, and phase switching behavior.multi_allelic,p_multi: enables and thresholds multi-allelic CNA detection.min_cells: drops very small groups to avoid unstable HMM and phylogeny reconstruction steps.
Set spatial=True to integrate the spatial graph connectivity structure in the posterior smoothing. Key options:
Implementations of distance-to-weight kernels that transform a dissimilarity matrix (for example, a distance matrix) into an affinity matrix.
| kind | Weight function | Behavior | Use when |
|---|---|---|---|
"gaussian" |
Very local, fast decay | Sharp local structure, boundary preservation | |
"exp" |
Broader than Gaussian | Slightly smoother local borrowing | |
"invdist" |
Strong nearest-neighbor emphasis | Nearest neighbors should dominate | |
"cauchy" |
Robust, moderate tail | Noisy or heterogeneous distances |
Chooses the method used to perform spatial smoothing of the CNA probability graph.
| method | Update / rule | Behavior | Use when |
|---|---|---|---|
"degree" |
One-hop local average | Mild local smoothing is enough | |
"diffuse" |
Multi-step diffusion with restart | Spatially coherent signal needs denoising | |
"cpr" |
|
Geometry-aware diffusion, less density bias | Graph density is uneven |
During execution, run_spacenumbat writes intermediate and final files such as:
sc_refs.tsv: Per-cell (or per-spot) reference assignment: for each barcode, which reference profile column fromlambdas_refwas selected as best matching by correlation.bulk_subtrees_*.tsv,bulk_subtrees_retest_*.tsv: Iteration-level pseudobulk profiles for current subtrees (cell groups).bulk_subtrees_{i}.tsv: output after HMM-based group analysis.bulk_subtrees_retest_{i}.tsv: same bulks after re-annotation/retest against consensus segments; low-support calls are reset to neutral based on min_LLR.
bulk_clones_*.tsv,bulk_clones_final.tsv: Iteration-level and final pseudobulk profiles for inferred clones.bulk_clones_{i}.tsv: clone bulks after HMM + retest in iteration i.bulk_clones_final.tsv: final rerun on the end-of-workflow clone definitions (final clone pseudobulk CNA profiles).
segs_consensus_*.tsv: Iteration-level consensus CNA segment table built across groups/samples: merged CNV intervals, overlap-resolved consensus calls, optional retest intervals, and neutral segments filled in; includes segment-level CNA states and prior.exp_post_*.tsv,allele_post_*.tsv,joint_post_*.tsv: Per-cell, per-segment, per-state posterior tables:- exp_post: expression-only evidence + segment priors -> posterior CNV probabilities.
- allele_post: allele-count evidence + segment priors -> posterior CNV probabilities.
- joint_post: merged expression + allele evidence (optionally spatially smoothed) with recomputed joint posterior/state calls.
clone_post_*.tsv: per spot prediction on clone assignment labels and probabilities and tumor/normal labels and probabilities.geno_*.tsv: per-spot CNAs probability matrix.- Optional plots (
*.jpg,*.png) whenplot_results=True.
At a high level, spacenumbat predicts CNAs by iteratively:
- Validating and harmonizing inputs across expression, allele counts, and genome annotation.
- Building initial cell groupings from smoothed expression profiles.
- Calling group-level CNAs with HMM-based segmentation.
- Deriving consensus segments and retesting them.
- Computing per-cell posterior probabilities from expression and allele evidence.
- Combining evidence into joint CNA posteriors (optionally spatially smoothed).
- Inferring clone phylogeny, reassigning cells, and refining clone/subtree definitions.
- Repeating for
max_iteriterations, then writing final clone-level profiles and outputs.
spacenumbat is developed by the λ Lab.
This project is an independent Python implementation of the ideas described in the Numbat publications and software ecosystem originally developed by Teng Gao and colleagues at the Kharchenko Lab.
