Skip to content

Commit effb57c

Browse files
committed
Update scvi-tools
1 parent 450905e commit effb57c

9 files changed

Lines changed: 317 additions & 270 deletions

File tree

docs/skills.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@
5454
- **Pathway Enrichment** - Pathway and gene-set enrichment analysis on gene lists or ranked gene data, with result interpretation. Supports over-representation analysis (ORA via Enrichr/Fisher's exact/hypergeometric), preranked and standard Gene Set Enrichment Analysis (GSEA), and single-sample scoring (ssGSEA/GSVA) using gseapy and the official g:Profiler client. Covers gene-set libraries (GO Biological Process/Molecular Function/Cellular Component, KEGG, Reactome, WikiPathways, and MSigDB collections including Hallmark, C2 canonical pathways, C5 ontology, and C7 immune signatures), gene-ID mapping (Ensembl/Entrez to symbols via Biomart, g:Convert, or mygene) and organism handling, choice of the statistical background universe, multiple-testing correction (Benjamini-Hochberg FDR vs g:Profiler g:SCS vs Bonferroni), redundancy reduction (enrichment maps, leading-edge genes, term clustering), and publication-ready tables plus dotplots/bar plots/GSEA running-score plots. Includes a CLI helper (run_enrichment.py) that runs ORA or preranked GSEA end-to-end — automatically building the ranking metric from a DESeq2 results table — and writes a results table and dotplot. Cross-references PyDESeq2 and Scanpy upstream (sources of differentially expressed genes and cluster markers) and database-lookup/gget for gene-ID mapping and Reactome/KEGG/STRING APIs. Use cases: functional interpretation of differentially expressed genes, CRISPR-screen hits, and single-cell cluster markers; GO/KEGG/Reactome/WikiPathways enrichment; preranked GSEA from DESeq2 statistics; pathway activity scoring per sample or cell; and building defensible, reproducible enrichment analyses that avoid common pitfalls (gene-ID/organism mismatch, wrong background, thresholding before GSEA)
5555
- **Scanpy** - Comprehensive Python toolkit for single-cell RNA-seq data analysis built on AnnData (scanpy 1.12.x; Python 3.12+). Provides end-to-end workflows for preprocessing (quality control, scrublet doublet detection, normalization, log transformation), dimensionality reduction (PCA, UMAP, t-SNE), Leiden clustering, marker gene identification, pseudobulk aggregation via `get.aggregate()`, trajectory inference (PAGA, diffusion maps), and visualization. Key features include: efficient handling of large datasets using sparse matrices and experimental Dask out-of-core support, integration with scvi-tools for advanced analysis, batch correction methods (ComBat), and publication-quality plotting. Optional GPU acceleration via rapids-singlecell. Use cases: single-cell RNA-seq analysis, cell-type identification, exploratory cluster markers, pseudobulk DE workflows (with pydeseq2), trajectory analysis, and comprehensive single-cell genomics workflows
5656
- **scVelo** - RNA velocity analysis for estimating cell state transitions from unspliced/spliced mRNA dynamics. Infers trajectory directions, computes latent time, and identifies driver genes in single-cell RNA-seq data. Complements Scanpy/scVI-tools for trajectory inference, enabling the study of cellular differentiation dynamics and lineage decisions at single-cell resolution
57-
- **scvi-tools** - Probabilistic deep learning models for single-cell omics analysis. PyTorch-based framework providing variational autoencoders (VAEs) for dimensionality reduction, batch correction, differential expression, and data integration across modalities. Includes 25+ models: scVI/scANVI (RNA-seq integration and cell type annotation), totalVI (CITE-seq protein+RNA), MultiVI (multiome RNA+ATAC integration), PeakVI (ATAC-seq analysis), DestVI/Stereoscope/Tangram (spatial transcriptomics deconvolution), MethylVI (methylation), CytoVI (flow/mass cytometry), VeloVI (RNA velocity), contrastiveVI (perturbation studies), and Solo (doublet detection). Supports seamless integration with Scanpy/AnnData ecosystem, GPU acceleration, reference mapping (scArches), and probabilistic differential expression with uncertainty quantification
57+
- **scvi-tools** - Probabilistic deep learning models for single-cell omics analysis. PyTorch-based framework providing variational autoencoders (VAEs) for dimensionality reduction, batch correction, differential expression, and data integration across modalities. Includes 30+ models: scVI/scANVI (RNA-seq integration and cell type annotation), totalVI/totalANVI (CITE-seq protein+RNA), MultiVI (multiome RNA+ATAC integration), PeakVI (ATAC-seq analysis), DestVI/Stereoscope/Tangram (spatial transcriptomics deconvolution), MethylVI (methylation), CytoVI (flow/mass cytometry), VeloVI (RNA velocity), contrastiveVI (perturbation studies), and Solo (doublet detection). Supports seamless integration with Scanpy/AnnData ecosystem, GPU acceleration, reference mapping (scArches), and probabilistic differential expression with uncertainty quantification
5858
- **scikit-bio** - Python library for bioinformatics providing data structures, algorithms, and parsers for biological sequence analysis. Built on NumPy, SciPy, and pandas. Key features include: sequence objects (DNA, RNA, protein sequences) with biological alphabet validation, sequence alignment algorithms (local, global, semiglobal), phylogenetic tree manipulation, diversity metrics (alpha diversity, beta diversity, phylogenetic diversity), distance metrics for sequences and communities, file format parsers (FASTA, FASTQ, QIIME formats, Newick), and statistical analysis tools. Provides scikit-learn compatible transformers for machine learning workflows. Supports efficient processing of large sequence datasets. Use cases: sequence analysis, microbial ecology (16S rRNA analysis), metagenomics, phylogenetic analysis, and bioinformatics research requiring sequence manipulation and diversity calculations
5959
- **TileDB-VCF** - High-performance C++ library with Python and CLI interfaces for efficient storage and retrieval of genomic variant-call data using TileDB multidimensional sparse array technology. Enables scalable VCF/BCF ingestion with incremental sample addition, compressed storage, parallel queries across genomic regions and samples, and export capabilities for population genomics workflows. Key features include: memory-efficient queries, cloud storage integration (S3, Azure, GCS), and CLI tools for dataset creation, sample ingestion, data export, and statistics. Supports building variant databases for large cohorts, population-scale genomics studies, and association analysis. Use cases: population genomics databases, cohort studies, variant discovery workflows, genomic data warehousing, and scaling to enterprise-level analysis with TileDB-Cloud platform
6060
- **Zarr** - Python library (Zarr-Python 3.x) implementing chunked, compressed N-dimensional arrays for local disk and cloud object storage (S3, GCS via fsspec). Supports Zarr format 2 and 3, `zarr.codecs` compression (Blosc, gzip, zstd), partial chunk reads, consolidated metadata, sharding, and integration with NumPy, Dask, and Xarray. Use for out-of-core arrays, cloud-native pipelines, and large scientific datasets (genomics, imaging, climate). Skill: `zarr-python`

skills/scvi-tools/SKILL.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,17 @@ name: scvi-tools
33
description: Deep generative models for single-cell omics. Use when you need probabilistic batch correction (scVI), transfer learning, differential expression with uncertainty, or multi-modal integration (TOTALVI, MultiVI). Best for advanced modeling, batch effects, multimodal data. For standard analysis pipelines use scanpy.
44
license: BSD-3-Clause license
55
metadata:
6-
version: "1.0"
6+
version: "1.1"
77
skill-author: K-Dense Inc.
88
---
99

1010
# scvi-tools
1111

1212
## Overview
1313

14-
scvi-tools is a comprehensive Python framework for probabilistic models in single-cell genomics. Built on PyTorch and PyTorch Lightning, it provides deep generative models using variational inference for analyzing diverse single-cell data modalities.
14+
scvi-tools is a comprehensive Python framework for probabilistic models in single-cell genomics. Built on PyTorch and PyTorch Lightning, it provides deep generative models using variational inference for analyzing diverse single-cell data modalities. Current stable release: **scvi-tools 1.4.3** (May 2026).
15+
16+
**Model namespaces matter:** core models (scVI, scANVI, totalVI, MultiVI, PeakVI, AUTOZI, CondSCVI, DestVI, LinearSCVI, AmortizedLDA, JaxSCVI) live under `scvi.model`. Most other models (VeloVI, contrastiveVI, CellAssign, PoissonVI, scBasset, MrVI, MethylVI/MethylANVI, CytoVI, SysVI, Decipher, gimVI, scVIVA, ResolVI, Stereoscope, Solo, totalANVI, DIAGVI) live under `scvi.external`. The reference files specify the correct namespace per model.
1517

1618
## When to Use This Skill
1719

@@ -46,8 +48,10 @@ Models for analyzing single-cell chromatin data. See `references/models-atac-seq
4648
### 3. Multimodal & Multi-omics Integration
4749
Joint analysis of multiple data types. See `references/models-multimodal.md` for:
4850
- **totalVI**: CITE-seq protein and RNA joint modeling
49-
- **MultiVI**: Paired and unpaired multi-omic integration
51+
- **totalANVI**: Semi-supervised CITE-seq (totalVI with cell-type labels)
52+
- **MultiVI**: Paired and unpaired multi-omic integration (MuData-based)
5053
- **MrVI**: Multi-resolution cross-sample analysis
54+
- **DIAGVI**: Diagonal integration of unpaired single-cell datasets (added in 1.4.3)
5155

5256
### 4. Spatial Transcriptomics
5357
Spatially-resolved transcriptomics analysis. See `references/models-spatial.md` for:
@@ -171,12 +175,20 @@ See `references/theoretical-foundations.md` for detailed background on the mathe
171175

172176
## Installation
173177

178+
Requires Python **3.12+** (scvi-tools 1.4 dropped older versions).
179+
174180
```bash
175181
uv pip install scvi-tools
176182
# For GPU support
177-
uv pip install scvi-tools[cuda]
183+
uv pip install "scvi-tools[cuda]"
178184
```
179185

186+
For reproducible environments, pin a version: `uv pip install scvi-tools==1.4.3`.
187+
188+
**Compute backends:** training defaults to PyTorch (CPU/GPU/TPU). A JAX backend
189+
(`scvi.model.JaxSCVI`) and an experimental MLX backend for Apple silicon
190+
(`scvi.model.mlxSCVI`) are available for select models.
191+
180192
## Best Practices
181193

182194
1. **Use raw counts**: Always provide unnormalized count data to models

skills/scvi-tools/references/differential-expression.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,22 @@ large_effect = de_results[
245245

246246
## Advanced Usage
247247

248+
### Differential Abundance
249+
250+
In addition to differential *expression*, models exposing the `VAEMixin` API
251+
provide `differential_abundance()` and `get_aggregated_posterior()` (added in
252+
v1.4.2) to test how cell-state abundance shifts between conditions in the
253+
learned latent space:
254+
255+
```python
256+
# Compare the latent-space abundance of two conditions
257+
da = model.differential_abundance(
258+
groupby="condition",
259+
group1="disease",
260+
group2="healthy",
261+
)
262+
```
263+
248264
### DE Within Specific Cells
249265

250266
```python

skills/scvi-tools/references/models-atac-seq.md

Lines changed: 36 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -95,19 +95,19 @@ da_results = model.differential_accessibility(
9595
- Fragment count matrix (cells × genomic regions)
9696
- Count data (not binary)
9797

98-
**Basic Usage**:
98+
**Basic Usage** (PoissonVI lives in `scvi.external`):
9999
```python
100-
scvi.model.POISSONVI.setup_anndata(
100+
scvi.external.POISSONVI.setup_anndata(
101101
adata,
102102
batch_key="batch"
103103
)
104104

105-
model = scvi.model.POISSONVI(adata)
105+
model = scvi.external.POISSONVI(adata)
106106
model.train()
107107

108108
# Get results
109109
latent = model.get_latent_representation()
110-
accessibility = model.get_accessibility_estimates()
110+
accessibility = model.get_normalized_accessibility()
111111
```
112112

113113
**Key Differences from PeakVI**:
@@ -143,27 +143,24 @@ accessibility = model.get_accessibility_estimates()
143143
- Peak accessibility matrix
144144
- Genome reference (for sequence extraction)
145145

146-
**Basic Usage**:
146+
**Basic Usage** (scBasset lives in `scvi.external`):
147147
```python
148-
# scBasset requires sequence information
149-
# First, extract sequences for peaks
150-
from scbasset import utils
151-
sequences = utils.fetch_sequences(adata, genome="hg38")
152-
153-
# Setup and train
154-
scvi.model.SCBASSET.setup_anndata(
148+
# scBasset needs per-peak DNA sequences. Add them to the AnnData first;
149+
# this downloads the genome (once) and stores one-hot codes in adata.varm.
150+
scvi.data.add_dna_sequence(
155151
adata,
156-
batch_key="batch"
152+
genome_name="hg38",
153+
install_genome=True,
157154
)
158155

159-
model = scvi.model.SCBASSET(adata, sequences=sequences)
156+
# Register the per-peak sequence code, then train
157+
scvi.external.SCBASSET.setup_anndata(adata, dna_code_key="dna_code")
158+
159+
model = scvi.external.SCBASSET(adata)
160160
model.train()
161161

162-
# Get latent representation
162+
# Cell embeddings (low-dimensional latent representation)
163163
latent = model.get_latent_representation()
164-
165-
# Interpret model: which sequences/motifs are important
166-
importance_scores = model.get_feature_importance()
167164
```
168165

169166
**Key Parameters**:
@@ -179,12 +176,16 @@ importance_scores = model.get_feature_importance()
179176
- **Transfer learning**: Fine-tune on new datasets
180177

181178
**Interpretability Tools**:
182-
```python
183-
# Get importance scores for sequences
184-
importance = model.get_sequence_importance(region_indices=[0, 1, 2])
185179

186-
# Predict accessibility for new sequences
187-
predictions = model.predict_accessibility(new_sequences)
180+
scBasset learns sequence-aware cell and peak embeddings. Transcription-factor
181+
activity is assessed by scoring motif sequences against the trained model rather
182+
than calling a single importance function. See the
183+
[scBasset user guide](https://docs.scvi-tools.org/en/stable/user_guide/models/scbasset.html)
184+
for the current motif-injection / TF-activity workflow.
185+
186+
```python
187+
# Cell embeddings for clustering / visualization
188+
cell_embedding = model.get_latent_representation()
188189
```
189190

190191
## Model Selection for ATAC-seq
@@ -275,14 +276,21 @@ model.save("peakvi_model")
275276
For paired multimodal data (RNA+ATAC from same cells), use **MultiVI** instead:
276277

277278
```python
278-
# For 10x Multiome or similar paired data
279-
scvi.model.MULTIVI.setup_anndata(
280-
adata,
279+
from mudata import MuData
280+
281+
# MultiVI is configured from a MuData object (setup_anndata was removed in v1.3)
282+
mdata = MuData({"rna": rna_adata, "atac": atac_adata})
283+
scvi.model.MULTIVI.setup_mudata(
284+
mdata,
281285
batch_key="sample",
282-
modality_key="modality" # "RNA" or "ATAC"
286+
modalities={"rna_layer": "rna", "atac_layer": "atac"},
283287
)
284288

285-
model = scvi.model.MULTIVI(adata)
289+
model = scvi.model.MULTIVI(
290+
mdata,
291+
n_genes=rna_adata.n_vars,
292+
n_regions=atac_adata.n_vars,
293+
)
286294
model.train()
287295

288296
# Get joint latent space

0 commit comments

Comments
 (0)