You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+19-16Lines changed: 19 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,6 +30,7 @@ nextflow run main.nf \
30
30
--metadata /path/to/metadata/file \
31
31
--ss_prefix /path/to/starsolo-results \
32
32
--cb_prefix /path/to/cellbender-results \
33
+
--ss_out Gene \ # to specify which STARsolo output folder to use (Gene or GeneFull)
33
34
--project_tag test1 \ # to specify the run to add to the end of output folder (e.g. scautoqc-results-test1)
34
35
--batch_key sampleID \ # batch key to use in scVI integration
35
36
--ansi-log false \
@@ -63,25 +64,28 @@ output 9: final h5ad object
63
64
64
65
### 1. `gather_matrices`
65
66
66
-
This step requires three inputs:
67
-
* STARsolo output folder named "Gene"
68
-
* STARsolo output folder named "Velocyto"
69
-
* Cellbender output in h5 format (if Cellbender output doesn't exist, change mode to cb+normal) (option to change mode will be added in the future)
67
+
The inputs for the first step are determined according to STARsolo output which is specificed to use `ss_out`:
68
+
* By default, "Gene" folder is used (assuming the data is single-cell).
69
+
* "GeneFull" folder is used if the data is single-nuc.
70
70
71
-
`gather_matrices` step combines the matrices from three inputs into one h5ad object with four layers (raw, spliced, unspliced, ambiguous). Main expression matrix, cell and gene metadata are retrieved from Cellbender output. Raw matrix is retrieved from the expression matrix of STARsolo output folder named Gene. Spliced, unspliced and ambiguous matrices are all retrieved from the expression matrices of STARsolo output folder named Velocyto.
71
+
This step requires three inputs:
72
+
* STARsolo output folder named "Gene" (or "GeneFull")
73
+
* STARsolo output folder named "Velocyto" (ignored if "GeneFull")
74
+
* Cellbender output in h5 format (if Cellbender output doesn't exist, change mode to cb+normal) (option to change mode will be added in the future)
75
+
76
+
`gather_matrices` step combines the matrices from three inputs into one h5ad object with multiple layers: raw, spliced, unspliced, ambiguous (only raw layer is considered for "GeneFull" option is specified). Main expression matrix, cell and gene metadata are retrieved from Cellbender output. Raw matrix is retrieved from the expression matrix of STARsolo output folder named Gene. Spliced, unspliced and ambiguous matrices are all retrieved from the expression matrices of STARsolo output folder named Velocyto.
72
77
73
78
### 2. `run_qc`
74
79
75
80
This step requires the output of `gather_matrices` step which is the h5ad object with four layers.
76
81
77
-
`run_qc` step uses main automatic QC workflow which is summarised [here](https://teichlab.github.io/sctk/notebooks/automatic_qc.html). It applies the QC based on QC metrics, and run CellTypist based on four models which are specified below and defined as default in this pipeline:
82
+
`run_qc` step uses main automatic QC workflow which is summarised [here](https://teichlab.github.io/sctk/notebooks/automatic_qc.html). It applies the QC based on 8 QC metrics (log1p_n_counts, log1p_n_genes, percent_mito, percent_ribo, percent_hb, percent_top50, percent_soup, percent_spliced - last one is ignored if "GeneFull" option is specified), and run CellTypist based on four models which are specified below and defined as default in this pipeline:
78
83
***cecilia22_predH:** CellTypist model from the immune populations combined from 20 tissues of 18 studies, includes 32 cell types (ref: [Domínguez-Conde et al, 2022](https://doi.org/10.1126/science.abl5197))
79
84
***cecilia22_predL:** CellTypist model from the immune sub-populations combined from 20 tissues of 18 studies, includes 98 cell types (ref: [Domínguez-Conde et al, 2022](https://doi.org/10.1126/science.abl5197))
80
85
***elmentaite21_pred:** CellTypist model from the intestinal cells from fetal, pediatric (healthy and Crohn's disease) and adult human gut, includes 134 cell types (ref: [Elmentaite et al, 2021](https://doi.org/10.1038/s41586-021-03852-1))
81
86
***suo22_pred:** CellTypist model from the stromal and immune populations from the human fetus, includes 138 cell types (ref: [Suo et al, 2022](https://doi.org/10.1126/science.abo0510))
82
87
***megagut_pred:** CellTypist model from the all cells in Pan-GI study, includes 89 cell types (ref: [Oliver et al, 2024 (in press)])
83
88
84
-
85
89
### 3. `find_doublets`
86
90
87
91
This step requires the output of `run_qc` step which is the h5ad object with postqc columns.
@@ -103,7 +107,7 @@ This step requires the h5ad output from `pool_all` and the scrublet csv outputs
103
107
### 6. `integrate`
104
108
105
109
This step requires the h5ad object from `add_metadata` step.
106
-
`integrate` step removes stringent doublets (doublet score higher than 0.3, and bh score lower than 0.05) applies scVI integration to all samples by using "sampleID" as a batch key, and "log1p_n_counts" and "percent_mito" columns as categorical covariates. The final integrated object is given as the output of all of this pipeline. The steps below are applied before running integration:
110
+
`integrate` step removes stringent doublets (doublet score higher than 0.3, and bh score lower than 0.05) applies scVI integration to all samples by using "sampleID" as a batch key (by default), and "log1p_n_counts" and "percent_mito" columns as categorical covariates. The final integrated object is given as the output of all of this pipeline. The steps below are applied before running integration:
107
111
* Stringent doublets are removed.
108
112
* 7500 higly variable genes are chosen.
109
113
* All cell cycle genes are removed.
@@ -112,13 +116,6 @@ This step requires the h5ad object from `add_metadata` step.
112
116
113
117
## Future plans
114
118
115
-
### Add support for multiome and single-nucleus samples
116
-
117
-
* nf-scautoqc currently uses "Gene" output folder from STARsolo.
118
-
* For the analysis of multiome and single-nucleus samples, GeneFull output folder from STARsolo is preferred. This matrix includes reads from introns.
119
-
* This pipeline currently doesn't support GeneFull matrices.
120
-
* In the future, the user will be able to specify STARsolo matrix (Gene or GeneFull).
121
-
122
119
### Add run_cellbender process
123
120
124
121
* Current version of pipeline assumes that the CellBender outputs exist.
@@ -141,8 +138,14 @@ This step requires the h5ad object from `add_metadata` step.
141
138
142
139
## Changelog
143
140
141
+
### v0.4.0
142
+
* Added support for single-nuc samples
143
+
* Improvements in scripts
144
+
* integration.py now removes the columns which were created in previous steps
145
+
* RESUME scripts have been reorganised
146
+
144
147
### v0.3.0
145
-
* New workflow: after_qc
148
+
*<ins>**New workflow:**</ins> after_qc
146
149
* It is now easier to work with the samples which has been processed with scAutoQC pipeline before.
147
150
* Improvements in nextflow pipeline and python scripts
0 commit comments