Skip to content

Commit e335192

Browse files
committed
Update README
1 parent be85b48 commit e335192

File tree

3 files changed

+37
-24
lines changed

3 files changed

+37
-24
lines changed

CHANGELOG.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,30 @@
11
## Changelog
22

3+
### 25-091
4+
This update introduces a new workflow and multiple enhancements based on user feedback:
5+
6+
* Added a new workflow:
7+
* `subset` workflow: Enables running the pipeline by subsetting objects using predefined cutoffs instead of the automatic QC workflow (steps 2a-3-4-5a). More details are available in the [README](README.md#Different-steps-for-`subset`-mode).
8+
* Added support for Cell Ranger outputs in addition to STARsolo.
9+
* Improvements in nextflow pipeline:
10+
* Updated the Singularity image for better compatibility.
11+
* Renamed certain output files for clarity.
12+
* Optimised the RESUME functionality to improve reliability.
13+
* Introduced smart memory allocation for the `pool_all` and `add_metadata` steps based on input size.
14+
* Optimised resource allocation for other processes.
15+
* Enabled the pipeline to work seamlessly with symbolic links in the input.
16+
* Optimisations in scripts:
17+
* Removed unused lines, characters, and packages for cleaner code.
18+
* Fixed hardcoded paths to improve flexibility.
19+
* Optimised memory usage in the `pool_all` process.
20+
321
### 25-064
422
* Added two new workflows:
523
* `until_integrate` workflow makes it easier to run the steps until integration (1-2-3-4-5)
624
* `only_integrate` workflow makes it easier to run the integration step only (6)
725
* Improvements and changes in scripts:
826
* Folder names in the outputs were renamed.
927

10-
1128
### 24-143
1229
* <ins>**New workflow:**</ins> `only_qc`
1330
* It is now easier to run the pipeline until the pooling step.

README.md

Lines changed: 19 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -8,23 +8,30 @@ The recommended way to use nextflow is to run it in a screen session. These step
88

99
1. Start a screen session: `screen -S nf_run1`
1010
2. Start a small interactive job for nextflow: `bsub -G cellgeni -n1 -R"span[hosts=1]" -Is -q long -R"select[mem>2000] rusage[mem=2000]" -M2000 bash`
11-
3. Modify one of RESUME scripts (pre-made Nextflow run scripts)
11+
3. Modify one of RESUME scripts in examples folder (pre-made Nextflow run scripts)
1212
4. Run the RESUME scripts you modified: `./RESUME-scautoqc-all`
1313
5. You can leave your screen session and let it run in the background: `Ctrl+A, D`
1414

1515
## Files:
1616

1717
* `main.nf` - the Nextflow pipeline that executes scAutoQC pipeline.
1818
* `nextflow.config` - the configuration script that allows the processes to be submitted to IBM LSF on Sanger's HPC and ensures correct environment is set via singularity container (this is an absolute path). Global default parameters are also set in this file and some contain absolute paths.
19-
* `RESUME-scautoqc-all` - an example run script that executes the whole pipeline.
20-
* `RESUME-scautoqc-afterqc` - an example run script that executes the pipeline after run_qc and find_doublets steps.
21-
* `RESUME-scautoqc-onlyqc` - an example run script that executes the pipeline after until pooling step.
22-
* `bin/gather_matrices.py` - a Python script that gathers matrices from STARsolo, Velocyto and Cellbender outputs (used in step 1).
23-
* `bin/qc.py` - a Python script that runs automatic QC workflow (used in step 2).
24-
* `bin/flag_doublet.py` - a Python script that runs scrublet to find doublets (used in step 3).
25-
* `bin/pool_all.py` - a Python script that combines all of the output objects after QC step (used in step 4).
26-
* `bin/add_scrublet_meta.py` - a Python script that adds scrublet scores (and metadata if available) (used in step 5).
27-
* `bin/integration.py` - a Python script that runs scVI integration (used in step 6).
19+
* `examples/` - a folder that includes pre-made Nextflow run scripts for each workflow:
20+
* `RESUME-scautoqc-all`
21+
* `RESUME-scautoqc-onlyqc`
22+
* `RESUME-scautoqc-afterqc`
23+
* `RESUME-scautoqc-untilintegrate`
24+
* `RESUME-scautoqc-onlyintegrate`
25+
* `RESUME-scautoqc-subset`
26+
* `bin/` - a folder that includes Python scripts used in the pipeline:
27+
* `gather_matrices.py` - gathers matrices from STARsolo, Velocyto and Cellbender outputs (used in step 1).
28+
* `qc.py` - runs automatic QC workflow (used in step 2).
29+
* `subset.py` - subsets the input object (used in step 2a).
30+
* `flag_doublet.py` - runs scrublet to find doublets (used in step 3).
31+
* `pool_all.py` - combines all of the output objects after QC step (used in step 4).
32+
* `add_scrublet_meta.py` - adds scrublet scores (and metadata if available) (used in step 5).
33+
* `add_scrublet_meta_basic.py` - adds scrubles scores but doesn't remove any cells or samples (used in 5a).
34+
* `integration.py` - runs scVI integration (used in step 6).
2835
* `genes_list/` - a folder that includes cell cycle, immunoglobulin and T cell receptor genes.
2936
* `Dockerfile` - a dockerfile to reproduce the environment used to run the pipeline.
3037

@@ -192,6 +199,8 @@ This step requires three inputs:
192199

193200
`gather_matrices` step combines the matrices from three inputs into one h5ad object with multiple layers for each sample: raw, spliced, unspliced, ambiguous (only raw layer is used if "GeneFull" mode is specified). Main expression matrix, cell and gene metadata are retrieved from Cellbender output. Raw matrix is retrieved from the expression matrix of STARsolo output folder named Gene. Spliced, unspliced and ambiguous matrices are all retrieved from the expression matrices of STARsolo output folder named Velocyto.
194201

202+
This step can also use Cell Ranger inputs if `cr_prefix` option is provided instead of `ss_prefix` option, however this won't include Velocyto outputs.
203+
195204
This step produces:
196205
* ***[output 1]:*** h5ad object with different layers
197206

@@ -282,19 +291,6 @@ This step requires the h5ad output from the `pool_all` step and the scrublet CSV
282291

283292
The `add_metadata_basic` step is also exclusive to the `subset` mode and replaces the `add_metadata` step from the main pipeline. The key differences are that it does not perform QC scoring per sample and does not remove any cells or samples.
284293

285-
286-
## Future plans
287-
288-
### Add run_cellbender process
289-
290-
* Current version of pipeline assumes that the Cellbender outputs exist.
291-
* This addition will allow the pipeline to run Cellbender if the inputs do not exist.
292-
293-
### Smart memory allocation
294-
295-
* This addition will estimate the average memory needed for pool_all step, so it won't need to try multiple times until it runs well.
296-
297-
298294
## Original workflow scheme
299295

300296
![](images/scautoqc-original-diagram.png)

images/workflow_modes.png

-7.33 KB
Loading

0 commit comments

Comments
 (0)