Skip to content

ebi-gene-expression-group/scxa-tertiary-workflow

Repository files navigation

scxa-tertiary-workflow

Nextflow run with singularity nf-test

Introduction

Tertiary component for Single-Cell Expression Atlas workflows, focused on post-processing and advanced analyses like normalization, PCA, clustering, t-SNE, and UMAP visualizations.

Overview

This Nextflow workflow is designed to perform analysis downstream of the quantification of expression counts from single-cell RNA sequencing (scRNA-seq) raw data. This tertiary analysis takes processed data (expression matrix and metadata) as input, normalizes and scales the data, identifies variable genes, runs principal component analysis (PCA), integrates batch effects using Harmony, calculates cell neighborhoods, finds clusters, and performs visualizations like UMAP and t-SNE. It also finds markers for cell groupings.

The workflow runs these analyses using Scanpy, leveraging the scanpy-scripts package to run individual steps of the Scanpy workflow.

How to run the workflow

Prepare the data

To perform a tertiary analysis, all required datasets must be stored in a single directory, which should be specified using the --dir_path parameter.

Ensure the following files are present in the directory:

  • genes_metadata.tsv – Metadata information for genes
  • genes.tsv – List of gene identifiers
  • barcodes.tsv – Cell barcode identifiers
  • matrix.mtx – Expression matrix in Matrix Market format
  • cell_metadata.tsv – Metadata information for individual cells

Requirements

  • Nextflow
  • Singularity or Docker

High-performance computing

This workflow can be run on High-performance computing.

  • SLURM. For SLURM job scheduling - use the --slurm option.

Running the workflow

The workflow can be executed for two types of scRNA-seq technologies: plate-based and droplet-based.

For plate-based data:

nextflow run main.nf --slurm -resume --dir_path <EXP-ID with path> [--output_path <PATH>]  [--scanpy_scripts_container <container_id>] [--celltype_field <celltype_field>]

For droplet-based data:

nextflow run main.nf --slurm -resume --dir_path <EXP-ID with path> --technology droplet [--output_path <PATH>] [--scanpy_scripts_container <container_id>] [--celltype_field <celltype_field>]
  • --technology droplet: Specifies that the data is droplet-based. This enables additional steps for multiplet detection (using Scrublet) and doublet removal.
  • The remaining parameters are the same as for the plate-based run.

The workflow uses Singularity by default, but users can add the -profile docker to run using Docker.

Running the workflow for SCEA

If running for Single-cell Expression Atlas, specify the Atlas-specific profile by adding -profile atlas to the Nextflow command.

Output

If [--output_path <PATH>] is not specified results will be <EXP-ID with path>/results dir.

Credits

ebi-gene-expression-group/scxa-tertiary-workflow was originally written by Anil Thanki, Iris Yu and Pedro Madrigal, based on a previous Galaxy workflow developed by Pablo Moreno and Jonathan Manning.

We thank the following people and teams for their extensive assistance in the development of this pipeline:

Citations

For now, if you use the workflow for your analysis please cite it using the following doi: 10.1093/nar/gkad1021

Acknowledging nf-core

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. In addition, references of tools and data used in this pipeline are as follows: