EvoFATE is a computational toolkit for analyzing high-throughput long-read single-cell RNA sequencing (LR-scRNAseq) data to jointly reconstruct Evolutionary (point mutation) and Fate (gene expression) trajectories in individual cells.
# Install from source
pip install -e .
# Or install with development dependencies
pip install -e ".[dev]"EvoFATE includes three major steps:
-
Genetic Graph Constructor (
evofate.genetic)- Builds genotype graphs from single-cell mutation data using Node2Vec
- Captures relationships among cells based on shared mutations
-
Evolutionary Lineage Tracer (
evofate.utils)- Calculates genetic timing
- Infers clone lineage layouts
- Projects cells in an evolution-informed embedding space
-
EvoFATE Integrator (
evofate.integration)- Integrates transcriptomic profiles with genetic structure using BGRL with GAT backbone
- Performs co-projection of modalities
- Computes EvoFATE time for ordered trajectory reconstruction
EvoFATE requires two main input datasets stored as AnnData objects:
The mutation profile matrix should be stored in .X of the AnnData object.
Format:
- Rows: Individual cells
- Columns: Mutation sites/positions
- Data type:
numpy.ndarrayorscipy.sparse.spmatrix - Shape:
(n_cells, n_mutations)
Encoding:
1: Mutant (MT) - mutation present in the cell-1: Wildtype (WT) - reference allele, no mutation0: Missing data - site not covered or uncertain
Requirements:
- Cell names should be stored in
.obs_names - Mutation/site names should be stored in
.var_names - Missing data should remain as
0(do not impute)
Example:
import numpy as np
import anndata as ad
import pandas as pd
# Create mutation matrix
n_cells = 1000
n_mutations = 500
mutation_matrix = np.random.choice([1, -1, 0], size=(n_cells, n_mutations),
p=[0.1, 0.8, 0.1])
# Create AnnData object
cell_names = [f"Cell_{i}" for i in range(n_cells)]
mutation_names = [f"Mut_{j}" for j in range(n_mutations)]
adata_mut = ad.AnnData(
X=mutation_matrix,
obs=pd.DataFrame(index=cell_names),
var=pd.DataFrame(index=mutation_names)
)Standard single-cell RNA sequencing count matrix.
Format:
- Rows: Individual cells (must match mutation data cell order)
- Columns: Genes
- Data type: Count matrix (typically sparse)
- Shape:
(n_cells, n_genes)
Important Notes:
- Both datasets must have matching cell identifiers/barcodes for proper integration
- Mutation calls should come from high-quality variant calling or long-read sequencing
- Expression data should be from the same cells as mutation data
- Missing mutation data (0) should not be imputed
cal_genetic_embedding(adata_mut, ...): Calculate genetic embeddings using Node2Vec
cal_evofate_embedding(adata_mut, ...): Calculate EvoFATE embeddings using BGRLBGRL: Bootstrapped Graph Representation Learning model class
Clone Analysis:
define_clones(adata_mut, ...): Define clonal populations from mutation datacal_timing(adata_mut, key): Calculate evolutionary timingcal_clone_connectivity(adata_mut): Calculate clone connectivity graphcal_tree_layout(adata_mut, ...): Calculate lineage tree layout
Projections:
cal_linear_projection(adata_mut, key): Linear projection onto tree coordinatescal_guided_umap_projection(adata_mut, ...): UMAP projection guided by timingCCA_projection(feature_matrix, x): CCA-based projectionguided_residual_projection(X_2d, y): Guided residual projection
Visualization:
plot_consensus_profile(adata_mut, ...): Plot consensus mutation profilesplot_lineage_tree(adata_mut, ...): Plot lineage treeplot_lineage_tree_w_piechart(adata_mut, label, ...): Plot tree with pie chartsplot_embedding(adata_mut, basis, labels, ...): Plot embeddings