Tertiary component for Single-Cell Expression Atlas workflows, focused on post-processing and advanced analyses like normalization, PCA, clustering, t-SNE, and UMAP visualizations.
This Nextflow workflow is designed to perform analysis downstream of the quantification of expression counts from single-cell RNA sequencing (scRNA-seq) raw data. This tertiary analysis takes processed data (expression matrix and metadata) as input, normalizes and scales the data, identifies variable genes, runs principal component analysis (PCA), integrates batch effects using Harmony, calculates cell neighborhoods, finds clusters, and performs visualizations like UMAP and t-SNE. It also finds markers for cell groupings.
The workflow runs these analyses using Scanpy, leveraging the scanpy-scripts package to run individual steps of the Scanpy workflow.
To perform a tertiary analysis, all required datasets must be stored in a single directory, which should be specified using the --dir_path parameter.
Ensure the following files are present in the directory:
genes_metadata.tsv– Metadata information for genesgenes.tsv– List of gene identifiersbarcodes.tsv– Cell barcode identifiersmatrix.mtx– Expression matrix in Matrix Market formatcell_metadata.tsv– Metadata information for individual cells
- Nextflow
- Singularity or Docker
This workflow can be run on High-performance computing.
- SLURM. For SLURM job scheduling - use the
--slurmoption.
The workflow can be executed for two types of scRNA-seq technologies: plate-based and droplet-based.
nextflow run main.nf --slurm -resume --dir_path <EXP-ID with path> [--output_path <PATH>] [--scanpy_scripts_container <container_id>] [--celltype_field <celltype_field>]nextflow run main.nf --slurm -resume --dir_path <EXP-ID with path> --technology droplet [--output_path <PATH>] [--scanpy_scripts_container <container_id>] [--celltype_field <celltype_field>]--technology droplet: Specifies that the data is droplet-based. This enables additional steps for multiplet detection (using Scrublet) and doublet removal.- The remaining parameters are the same as for the plate-based run.
The workflow uses Singularity by default, but users can add the -profile docker to run using Docker.
If running for Single-cell Expression Atlas, specify the Atlas-specific profile by adding -profile atlas to the Nextflow command.
If [--output_path <PATH>] is not specified results will be <EXP-ID with path>/results dir.
ebi-gene-expression-group/scxa-tertiary-workflow was originally written by Anil Thanki, Iris Yu and Pedro Madrigal, based on a previous Galaxy workflow developed by Pablo Moreno and Jonathan Manning.
We thank the following people and teams for their extensive assistance in the development of this pipeline:
For now, if you use the workflow for your analysis please cite it using the following doi: 10.1093/nar/gkad1021
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. In addition, references of tools and data used in this pipeline are as follows: