Skip to content

Commit 6a45e72

Browse files
committed
Update README
Add info on new ss_out parameter, more details on run_qc step, update changelog and small fixes
1 parent 25cec18 commit 6a45e72

File tree

1 file changed

+19
-16
lines changed

1 file changed

+19
-16
lines changed

README.md

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ nextflow run main.nf \
3030
--metadata /path/to/metadata/file \
3131
--ss_prefix /path/to/starsolo-results \
3232
--cb_prefix /path/to/cellbender-results \
33+
--ss_out Gene \ # to specify which STARsolo output folder to use (Gene or GeneFull)
3334
--project_tag test1 \ # to specify the run to add to the end of output folder (e.g. scautoqc-results-test1)
3435
--batch_key sampleID \ # batch key to use in scVI integration
3536
--ansi-log false \
@@ -63,25 +64,28 @@ output 9: final h5ad object
6364

6465
### 1. `gather_matrices`
6566

66-
This step requires three inputs:
67-
* STARsolo output folder named "Gene"
68-
* STARsolo output folder named "Velocyto"
69-
* Cellbender output in h5 format (if Cellbender output doesn't exist, change mode to cb+normal) (option to change mode will be added in the future)
67+
The inputs for the first step are determined according to STARsolo output which is specificed to use `ss_out`:
68+
* By default, "Gene" folder is used (assuming the data is single-cell).
69+
* "GeneFull" folder is used if the data is single-nuc.
7070

71-
`gather_matrices` step combines the matrices from three inputs into one h5ad object with four layers (raw, spliced, unspliced, ambiguous). Main expression matrix, cell and gene metadata are retrieved from Cellbender output. Raw matrix is retrieved from the expression matrix of STARsolo output folder named Gene. Spliced, unspliced and ambiguous matrices are all retrieved from the expression matrices of STARsolo output folder named Velocyto.
71+
This step requires three inputs:
72+
* STARsolo output folder named "Gene" (or "GeneFull")
73+
* STARsolo output folder named "Velocyto" (ignored if "GeneFull")
74+
* Cellbender output in h5 format (if Cellbender output doesn't exist, change mode to cb+normal) (option to change mode will be added in the future)
75+
76+
`gather_matrices` step combines the matrices from three inputs into one h5ad object with multiple layers: raw, spliced, unspliced, ambiguous (only raw layer is considered for "GeneFull" option is specified). Main expression matrix, cell and gene metadata are retrieved from Cellbender output. Raw matrix is retrieved from the expression matrix of STARsolo output folder named Gene. Spliced, unspliced and ambiguous matrices are all retrieved from the expression matrices of STARsolo output folder named Velocyto.
7277

7378
### 2. `run_qc`
7479

7580
This step requires the output of `gather_matrices` step which is the h5ad object with four layers.
7681

77-
`run_qc` step uses main automatic QC workflow which is summarised [here](https://teichlab.github.io/sctk/notebooks/automatic_qc.html). It applies the QC based on QC metrics, and run CellTypist based on four models which are specified below and defined as default in this pipeline:
82+
`run_qc` step uses main automatic QC workflow which is summarised [here](https://teichlab.github.io/sctk/notebooks/automatic_qc.html). It applies the QC based on 8 QC metrics (log1p_n_counts, log1p_n_genes, percent_mito, percent_ribo, percent_hb, percent_top50, percent_soup, percent_spliced - last one is ignored if "GeneFull" option is specified), and run CellTypist based on four models which are specified below and defined as default in this pipeline:
7883
* **cecilia22_predH:** CellTypist model from the immune populations combined from 20 tissues of 18 studies, includes 32 cell types (ref: [Domínguez-Conde et al, 2022](https://doi.org/10.1126/science.abl5197))
7984
* **cecilia22_predL:** CellTypist model from the immune sub-populations combined from 20 tissues of 18 studies, includes 98 cell types (ref: [Domínguez-Conde et al, 2022](https://doi.org/10.1126/science.abl5197))
8085
* **elmentaite21_pred:** CellTypist model from the intestinal cells from fetal, pediatric (healthy and Crohn's disease) and adult human gut, includes 134 cell types (ref: [Elmentaite et al, 2021](https://doi.org/10.1038/s41586-021-03852-1))
8186
* **suo22_pred:** CellTypist model from the stromal and immune populations from the human fetus, includes 138 cell types (ref: [Suo et al, 2022](https://doi.org/10.1126/science.abo0510))
8287
* **megagut_pred:** CellTypist model from the all cells in Pan-GI study, includes 89 cell types (ref: [Oliver et al, 2024 (in press)])
8388

84-
8589
### 3. `find_doublets`
8690

8791
This step requires the output of `run_qc` step which is the h5ad object with postqc columns.
@@ -103,7 +107,7 @@ This step requires the h5ad output from `pool_all` and the scrublet csv outputs
103107
### 6. `integrate`
104108

105109
This step requires the h5ad object from `add_metadata` step.
106-
`integrate` step removes stringent doublets (doublet score higher than 0.3, and bh score lower than 0.05) applies scVI integration to all samples by using "sampleID" as a batch key, and "log1p_n_counts" and "percent_mito" columns as categorical covariates. The final integrated object is given as the output of all of this pipeline. The steps below are applied before running integration:
110+
`integrate` step removes stringent doublets (doublet score higher than 0.3, and bh score lower than 0.05) applies scVI integration to all samples by using "sampleID" as a batch key (by default), and "log1p_n_counts" and "percent_mito" columns as categorical covariates. The final integrated object is given as the output of all of this pipeline. The steps below are applied before running integration:
107111
* Stringent doublets are removed.
108112
* 7500 higly variable genes are chosen.
109113
* All cell cycle genes are removed.
@@ -112,13 +116,6 @@ This step requires the h5ad object from `add_metadata` step.
112116

113117
## Future plans
114118

115-
### Add support for multiome and single-nucleus samples
116-
117-
* nf-scautoqc currently uses "Gene" output folder from STARsolo.
118-
* For the analysis of multiome and single-nucleus samples, GeneFull output folder from STARsolo is preferred. This matrix includes reads from introns.
119-
* This pipeline currently doesn't support GeneFull matrices.
120-
* In the future, the user will be able to specify STARsolo matrix (Gene or GeneFull).
121-
122119
### Add run_cellbender process
123120

124121
* Current version of pipeline assumes that the CellBender outputs exist.
@@ -141,8 +138,14 @@ This step requires the h5ad object from `add_metadata` step.
141138

142139
## Changelog
143140

141+
### v0.4.0
142+
* Added support for single-nuc samples
143+
* Improvements in scripts
144+
* integration.py now removes the columns which were created in previous steps
145+
* RESUME scripts have been reorganised
146+
144147
### v0.3.0
145-
* New workflow: after_qc
148+
* <ins>**New workflow:**</ins> after_qc
146149
* It is now easier to work with the samples which has been processed with scAutoQC pipeline before.
147150
* Improvements in nextflow pipeline and python scripts
148151
* Created new RESUME script for afterqc workflow.

0 commit comments

Comments
 (0)