Skip to content

Automate mitochondrial gene retrieval for SCTK QC #904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ All changed fall under either one of these types: `Added`, `Changed`, `Deprecate

## [Unreleased]

### Added

- automated mitochondrial gene retrieval with genomepy for SCTK quality control

## [0.9.6] - 2022-10-31

### Changed
Expand Down
10 changes: 3 additions & 7 deletions docs/content/workflows/scrna_seq.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,8 @@ barcodefile: "barcodes.tab"
#### scRNA-seq pre-processing and quality control
The seq2science scRNA workflow provides the option to perform automated pre-processing of raw scRNA-seq UMI count matrices. This is achieved by incorporating several quality control steps from the [singleCellTK](https://camplab.net/sctk/v2.4.1/index.html) Bioconductor package, such as cell-calling, doublet detection and assessment of cell-wise mitochondrial RNA content.

Genomepy will attempt to extract the correct list of mitochondrial genes for quality control automatically and store it in `path/to/results/scrna-preprocess/{quantifier}`. In case the list could not be extracted from the genome annotation, this step will be skipped and needs to be repeated manually.

The QC results are reported in comprehensive R Markdown reports and processed UMI count matrices are stored as [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) S4 objects. Any sample level metadata that has been added to the `samples.tsv` file will be transferred to the `colData` slot of the corresponding object and assigned to each cell identifier.

After running seq2science, these objects can be directly imported into R with the `readRDS` function for further down-stream analysis with your favorite R packages.
Expand All @@ -192,8 +194,6 @@ sc_preprocess:
use_alt_expr: False
alt_exp_name: ""
alt_exp_reg: ""
sctk_mito_set: human-ensembl
sctk_detect_mito: True
sctk_detect_cell: False
sctk_cell_calling: Knee
```
Expand All @@ -211,12 +211,8 @@ We do not perform any gene/cell level filtering, except for empty droplets that

#### Advanced settings

To perform additional (optional) QC steps, consider the following parameters:
To perform additional (optional) QC steps, consider the following parameters.

* `sctk_detect_mito` (*default*: `True`)<br/>
Quantify the percentage of mitochondrial RNA for each cell.
* `sctk_mito_set` (*default*: `human-ensembl`)<br/>
The mitochondrial gene set to use for quantification with syntax `[human,mouse]-[ensembl,entrez,symbol]`. At the moment, only human and mouse gene annotations are supported. This option is only considered when `sctk_detect_mito=True`.
* `sctk_detect_cell` (*default*: `True`)<br/>
Perform cell-calling for droplet based experiments. Empty droplets will not be removed if set to `False`.
* `sctk_cell_calling` (*default*: `Knee`)<br/>
Expand Down
17 changes: 16 additions & 1 deletion seq2science/rules/qc_scRNA.smk
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,27 @@ rule export_sce_obj:
f"{config['rule_dir']}/../scripts/singlecell/read_kb_counts.R"


rule get_mt_genes:
"""
Extract MT genes from gtf annotation for QC
"""
input:
gtf=rules.get_genome_annotation.output.gtf,
output:
dir=expand("{result_dir}/scrna-preprocess/{quantifier}/{{assembly}}-mt.txt", **config),
log:
expand("{log_dir}/scrna-preprocess/{quantifier}/{{assembly}}-mt.log", **config),
script:
f"{config['rule_dir']}/../scripts/genomepy/get_genome_mt.py"


rule sctk_qc:
"""
Perform scRNA QC with singleCellTK, store output in SingleCellExperiment object and export to RData format.
"""
input:
rds_raw=rules.export_sce_obj.output.dir[0]
rds_raw=rules.export_sce_obj.output.dir[0],
mt_genes=rules.get_mt_genes.output[0]
output:
dir=expand(
"{result_dir}/scrna-preprocess/{quantifier}/sctk/{{assembly}}-{{sample}}/{file}",
Expand Down
12 changes: 0 additions & 12 deletions seq2science/schemas/config/scrna.schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,6 @@ properties:
description: Type of UMI count matrix, either cell or droplet counts
default: 'cell'
enum: ['cell','droplet']
sctk_detect_mito:
description: Calculate mitochondrial gene ratio for quality control
default: True
type: boolean
sctk_mito_set:
description: Mitochondrial gene set to use for quality control
default: 'human-symbol'
enum: ['human-ensembl', 'mouse-ensembl', 'human-entrez', 'mouse-entrez', 'human-symbol', 'mouse-symbol']
sctk_detect_cell:
description: Perform cell calling for droplet based scRNA-seq assays
default: True
Expand All @@ -74,10 +66,6 @@ properties:
description: File formats for SingleCellExperiment object export
default: ['Seurat']
type: array
sctk_qc_algos:
description: QC algorithms for CellQC (debug only)
default: ["QCMetrics", "scDblFinder", "decontX"]
type: array
use_alt_expr:
description: Process alternative experiments (if present) such as ERCC Spike-in's
default: False
Expand Down
48 changes: 48 additions & 0 deletions seq2science/scripts/genomepy/get_genome_mt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import genomepy
import contextlib


annotation_path=snakemake.input[0]
logfile = snakemake.log[0]
output = snakemake.output[0]


def mito_genes(annotation_path):
ann = genomepy.annotation.Annotation(annotation_path)
gtf = ann.named_gtf

# very quick, very dirty way to find the name for the mitochondrion.
# could also try to read this from the assembly report, but that's also not perfect.
mt = gtf[gtf["seqname"].str.contains("chrM", case=False, regex=False)]["seqname"].unique()
if len(mt) != 1:
mt = gtf[gtf["seqname"].str.contains("MT", case=False, regex=False)]["seqname"].unique()
if len(mt) != 1:
mt = gtf[gtf["seqname"].str.contains("mito", case=False, regex=False)]["seqname"].unique()
if len(mt) != 1:
mt = gtf[gtf["seqname"].str.contains("m", case=False, regex=False)]["seqname"].unique()
if len(mt) != 1:
print("Could not extract mitochondrial gene list from annotation. We tried.....")
return {'NONE'}

genes = set(gtf[gtf["seqname"] == mt[0]].index)
return genes


# redirect all messages to a logfile
with open(logfile, "w") as log:
with contextlib.redirect_stdout(log), contextlib.redirect_stderr(log):
genomepy.logger.remove()
genomepy.logger.add(
logfile,
format="<green>{time:HH:mm:ss}</green> <bold>|</bold> <blue>{level}</blue> <bold>|</bold> {message}",
level="INFO",
)
try:
genes = mito_genes(annotation_path)
#Write genes to file
with open(output, 'w') as f:
for line in genes:
f.write(f"{line}\n")
except Exception as e:
print(e)
print("\nSomething went wrong while extracting the MT genes from your annotation (see error message above). ")
51 changes: 32 additions & 19 deletions seq2science/scripts/singlecell/sctk_qc.R
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,13 @@ sample <- snakemake@params$sample
isvelo <- snakemake@params$isvelo
replicates <- snakemake@params$replicates
data_type <- snakemake@config$sc_preprocess$sctk_data_type
mito_set <- snakemake@config$sc_preprocess$sctk_mito_set
detect_cell <- snakemake@config$sc_preprocess$sctk_detect_cell
detect_mito <- snakemake@config$sc_preprocess$sctk_detect_mito
cell_calling <- snakemake@config$sc_preprocess$sctk_cell_calling
use_alt_exp <- snakemake@config$sc_preprocess$use_alt_expr
pdf_out <- file.path(out_dir, "SCTK_DropletQC_figures.pdf", fsep = "/")
numCores <- snakemake@threads
export_formats <- snakemake@config$sc_preprocess$sctk_export_formats
cellQCAlgos <- snakemake@config$sc_preprocess$sctk_qc_algos
mt_list <- snakemake@input$mt_genes

# Log all console output
log <- file(log_file, open = "wt")
Expand All @@ -35,6 +33,8 @@ sink(log, type = "message")
scrna_utils <- file.path(scripts_dir, "singlecell", "utils.R")
source(scrna_utils)

# Set QC algos
cellQCAlgos <- c("QCMetrics", "scDblFinder", "decontX")

# Log all variables for debugging purposes
cat("# variables used for this analysis:\n")
Expand All @@ -47,13 +47,12 @@ cat('rds_in <- "', rds_in, '"\n', sep = "")
cat('out_dir <- "', out_dir, '"\n', sep = "")
cat('pdf_out <- "', pdf_out, '"\n', sep = "")
cat('data_type <- "', data_type, '"\n', sep = "")
cat('detect_mito <- "', detect_mito, '"\n', sep = "")
cat('mito_set <- "', mito_set, '"\n', sep = "")
cat('detect_cell <- "', detect_cell, '"\n', sep = "")
cat('cell_calling <- "', cell_calling, '"\n', sep = "")
cat('use_alt_exp <- "', use_alt_exp, '"\n', sep = "")
cat('cellQCAlgos <- "', toString(cellQCAlgos), '"\n', sep = "")
cat('export_formats <- "', toString(export_formats), '"\n', sep = "")
cat('mt_list <- "', mt_list, '"\n', sep = "")
cat('\n')

cat('Sessioninfo:\n')
Expand Down Expand Up @@ -86,13 +85,20 @@ collectionName <- NULL
if (tolower(data_type) == "cell") {
message(paste0(date(), " .. Running CellQC"))
# Import mitochondrial gene collection
if (isTRUE(detect_mito)) {
# Import mitoset
mitoset <- strsplit(mito_set, "-")
subset_name <- stringr::str_to_title(mitoset[[1]])
subset_name <- paste(c(subset_name, "Mito"), collapse = "")
collectionName <- subset_name
sce <- importMitoGeneSet(sce, reference = mitoset[[1]][1], id = mitoset[[1]][2], by = "rownames", collectionName = collectionName)
geneSetList <- read.table(mt_list, quote = "\"", comment.char = "")
if (!isTRUE(geneSetList$V1 == "NONE")) {
geneSetList <- list(geneSetList$V1)
names(geneSetList) <- "Mito"
collectionName <- "Mito"
# Read gene set collection from list
sce <- importGeneSetsFromList(
sce,
geneSetList,
collectionName = "Mito",
by = "rownames"
)
} else {
message(paste0(date(), "No mitochondrial genes found. Skipping QC!"))
}
# Run QC with mitochondrial gene collection
cellSCE <- runCellQC(sce,
Expand All @@ -119,13 +125,20 @@ if (tolower(data_type) == "droplet") {
cellSCE <- dropletSCE[, ix]
message(paste0(date(), " .. Running CellQC"))
# Detect mitochondrial genes
if (isTRUE(detect_mito)) {
# Import mitoset
mitoset <- strsplit(mito_set, "-")
subset_name <- stringr::str_to_title(mitoset[[1]])
subset_name <- paste(c(subset_name, "Mito"), collapse = "")
collectionName <- subset_name
cellSCE <- importMitoGeneSet(cellSCE, reference = mitoset[[1]][1], id = mitoset[[1]][2], by = "rownames", collectionName = collectionName)
geneSetList <- read.table(mt_list, quote = "\"", comment.char = "")
if (!isTRUE(geneSetList$V1 == "NONE")) {
geneSetList <- list(geneSetList$V1)
names(geneSetList) <- "Mito"
collectionName <- "Mito"
# Read gene set collection from list
cellSCE <- importGeneSetsFromList(
cellSCE,
geneSetList,
collectionName = "Mito",
by = "rownames"
)
} else {
message(paste0(date(), "No mitochondrial genes found. Skipping QC!"))
}
# Run QC with mitochondrial gene collection
cellSCE <- runCellQC(cellSCE,
Expand Down