Skip to content

Commit 6483a46

Browse files
authored
Merge branch 'main' into deprecate_download_file
2 parents b8cbd8c + 640f52b commit 6483a46

90 files changed

Lines changed: 769 additions & 802 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.lintr

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
exclusions: list(
2+
"README.qmd"
3+
)

CHANGELOG.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,13 @@
22

33
## BREAKING CHANGES
44

5-
* `download_file` has been deprecated and will be removed in a future release (PR #1015).
5+
* Removed `split_h5mu_train_test` component (PR #1020).
6+
7+
* `compress_h5mu`: rename `compression` argument to `output_compression` (PR #1017, PR #1018).
8+
9+
* `delimit_fraction`: remove unused `layer` argument (PR #1018).
10+
11+
* `download_file` has been deprecated and will be removed in openpipeline 3.0 (PR #1015).
612

713
## MAJOR CHANGES
814

@@ -12,6 +18,18 @@
1218

1319
* Remove `workflows` directory (PR #993). The workflows which were at one point in this directory were all deprecated and moved to `src/workflows`.
1420

21+
* Move output file compression argument for AnnData and MuData files to a base config file (`src/base/h5_compression_argument.yaml`) (PR #1017).
22+
23+
* Add missing descriptions to components and arguments (PR #1018).
24+
25+
## BUG FIXES
26+
27+
* Bump viash to 0.9.4. This adds support for nextflow versions starting major version 25.01 and fixes an issue where an integer being passed to a argument with `type: double` resulted in an error (PR #1016).
28+
29+
* Fix running `neigbors_leiden_umap` workflow with `-stub` enabled (PR #1026).
30+
31+
* Add missing CUDA enabled `jaxlib` to components that use `scvi-tools` (`scanvi`, `scarches`, `scvi` and `totalvi`) (PR #1028)
32+
1533
# openpipelines 2.1.0
1634

1735
## BREAKING CHANGES

README.md

Lines changed: 253 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,258 @@
1-
OpenPipeline
2-
================
31

4-
<!-- README.md is generated by running 'quarto render README.qmd' -->
2+
3+
# OpenPipeline
54

65
Extensible single cell analysis pipelines for reproducible and
76
large-scale single cell processing using Viash and Nextflow.
87

9-
The provided pipelines are built using the [Viash
10-
framework](http://www.viash.io) on top of the nextflow workflow system.
11-
For more information on Nextflow please visit the [Nextflow github
12-
page](https://github.com/nextflow-io/nextflow) and the [Nextflow read
13-
the docs page](https://www.nextflow.io/docs/latest/index.html).
8+
[![ViashHub](https://img.shields.io/badge/ViashHub-openpipeline-7a4baa.svg)](https://www.viash-hub.com/packages/openpipeline)
9+
[![GitHub](https://img.shields.io/badge/GitHub-viash--hub%2Fopenpipeline-blue.svg)](https://github.com/openpipelines-bio/openpipeline)
10+
[![GitHub
11+
License](https://img.shields.io/github/license/openpipelines-bio/openpipeline.svg)](https://github.com/openpipelines-bio/openpipeline/blob/main/LICENSE)
12+
[![GitHub
13+
Issues](https://img.shields.io/github/issues/openpipelines-bio/openpipeline.svg)](https://github.com/openpipelines-bio/openpipeline/issues)
14+
[![Viash
15+
version](https://img.shields.io/badge/Viash-v0.9.3-blue.svg)](https://viash.io)
16+
17+
## Documentation
18+
19+
Please find more in-depth documentation on [the
20+
website](https://openpipelines.bio/).
21+
22+
## Functionality Overview
23+
24+
Openpipelines execute a list of predefined tasks. These descrete steps
25+
are also provided as standalone components that can be executed
26+
individually, with a standardized interface. This is especially useful
27+
when a particular step wraps a tool that you do not necessarily always
28+
need to execute in a workflow context.
29+
30+
In terms of workflows, the following functionality is provided:
31+
32+
- Demultiplexing: conversion of raw sequencing data to FASTQ objects.
33+
- [Ingestion](https://openpipelines.bio/fundamentals/architecture.html#sec-ingestion):
34+
Read mapping and generating a count matrix.
35+
- [Single sample
36+
processing](https://openpipelines.bio/fundamentals/architecture.html#sec-single-sample):
37+
cell filtering and doublet detection.
38+
- [Multisample
39+
processing](https://openpipelines.bio/fundamentals/architecture.html#sec-multisample-processing):
40+
Count transformation, normalization, QC metric calulations.
41+
- [Integration](https://openpipelines.bio/fundamentals/architecture.html#sec-intergration):
42+
Clustering, integration and batch correction using single and
43+
multimodal methods.
44+
- Downstream analysis workflows
45+
46+
``` mermaid lang="mermaid"
47+
flowchart LR
48+
demultiplexing["Step 1: Demultiplexing"]
49+
ingestion["Step 2: Ingestion"]
50+
process_samples["Step 3: Process Samples"]
51+
integration["Step 4: Integration"]
52+
downstream["Step 5: Downstream"]
53+
demultiplexing-->ingestion-->process_samples-->integration-->downstream
54+
```
55+
56+
## Guided execution using Viash Hub (CLI and Seqera cloud)
57+
58+
Openpipelines is now available on [Viash
59+
Hub](https://www.viash-hub.com/packages/openpipeline/latest). Viash Hub
60+
provides a list of components and workflows, together with a graphical
61+
interface that guides you through the steps of running a workflow or
62+
standalone component. Intstructions are provided for using a local viash
63+
or nextflow executable (requires using a linux based OS), but connecting
64+
to a Seqera cloud instance is also supported.
65+
66+
## Execution using the nextflow executable
67+
68+
Executing a workflow is a bit more involved and requires familiarity
69+
with the command line interface (CLI).
70+
71+
### Setup
72+
73+
In order to use the workflows in this package on your local computer,
74+
you’ll need to do the following:
75+
76+
- Install [nextflow](https://www.nextflow.io/docs/latest/install.html)
77+
- Install a nextflow compatible executor. This workflow provides a
78+
profile for [docker](https://docs.docker.com/get-started/).
79+
80+
### Location of the workflow scripts
81+
82+
Nextflow workflow scripts, schema’s and configuration files can be found
83+
in the `target/nextflow` folder. On the `main` branch however, only the
84+
source code that needs to be build into the functionning workflows and
85+
components can be found. Instead, please refer to the `main_build`
86+
branch or any of the tags to find the `target` folders. Components and
87+
workflows are organized into namespaces, which can be nested. Workflows
88+
are located at `target/nextflow/workflows`, while components that
89+
execute individual workflow steps are
90+
91+
A reference of workflows and modules is also provided in the
92+
[documentation](https://openpipelines.bio/components/).
93+
94+
### Retrieving a list of a workflow parameters
95+
96+
A list of workflows arguments can be consulted in multiple ways:
97+
98+
- On [Viash Hub](https://www.viash-hub.com/packages/openpipeline/latest)
99+
- In the [reference
100+
documentation](https://openpipelines.bio/components/)
101+
- The config YAML file lists the argument for each workflow and
102+
component
103+
- In the `target/nextflow` folder, a nextflow schema JSON file
104+
(`nextflow_schema.json`) is provided next to each workflow `.nf` file.
105+
- Using nextflow on the CLI:
106+
107+
``` bash
108+
nextflow run openpipelines-bio/openpipeline \
109+
-r 2.1.1 \
110+
-main-script target/nextflow/workflows/ingestion/demux/main.nf \
111+
--help
112+
```
113+
114+
### Resource usage tuning
115+
116+
Nextflow’s labels can be used to specify the amount of resources a
117+
process can use. This workflow uses the following labels for CPU, memory
118+
and disk:
119+
120+
- `lowmem`, `lowmem`, `midmem`, `highmem`, `veryhighmem`
121+
- `lowcpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu`
122+
- `lowdisk`, `middisk`, `highdisk`, `veryhighdisk`
123+
124+
The defaults for these labels can be found at
125+
`src/workflows/utils/labels.config`. Nextflow checks that the specified
126+
resources for a process do not exceed what is available on the machine
127+
and will not start if it does. Create your own config file to tune the
128+
labels to your needs, for example:
129+
130+
// Resource labels
131+
withLabel: verylowcpu { cpus = 2 }
132+
withLabel: lowcpu { cpus = 8 }
133+
withLabel: midcpu { cpus = 16 }
134+
withLabel: highcpu { cpus = 16 }
135+
136+
withLabel: verylowmem { memory = 4.GB }
137+
withLabel: lowmem { memory = 8.GB }
138+
withLabel: midmem { memory = 16.GB }
139+
withLabel: highmem { memory = 32.GB }
140+
141+
When starting nextflow using the CLI, you can use `-c` to provide the
142+
file to nextflow and overwrite the defaults.
143+
144+
### Demultiplexing example
145+
146+
Here, generating FASTQ files from raw sequencing data is demonstrated,
147+
based on data generated using 10X genomic’s protocols. However, BD
148+
genomics data is also supported by Openpipeline. If you wish to try it
149+
out yourself, test data is available at
150+
`s3://openpipelines-data/cellranger_tiny_bcl/bcl`.
151+
152+
``` bash
153+
nextflow run openpipelines-bio/openpipeline \
154+
-r 2.1.1 \
155+
-main-script target/nextflow/workflows/ingestion/demux/main.nf \
156+
-c "<path to resource config file>" \
157+
-profile docker \
158+
--publish_dir "<path to output directory>" \
159+
--id "cellranger_tiny_bcl" \
160+
--input "s3://openpipelines-data/cellranger_tiny_bcl/bcl" \
161+
--sample_sheet "s3://openpipelines-data/cellranger_tiny_bcl/bcl/sample_sheet.csv" \
162+
--demultiplexer "mkfastq"
163+
```
164+
165+
### Mapping and read counting
166+
167+
FASTQ files can be mapped to a reference genome and the resulting mapped
168+
reads can be counted in order to generate a count matrix. Both
169+
`BD Rhapsody` and `Cell Ranger` are supported. Here, we demonstrate
170+
using Cell Ranger multi on test data available at
171+
`s3://openpipelines-data/10x_5k_anticmv`.
172+
173+
In order to facilitate passing multiple argument values, the parameters
174+
can be specified using a YAML file.
175+
176+
``` yaml
177+
input:
178+
- "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_*.fastq.gz"
179+
- "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_*.fastq.gz"
180+
- "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_VDJ_*.fastq.gz"
181+
gex_reference: "s3://openpipelines-data/reference_gencodev41_chr1/reference_cellranger.tar.gz"
182+
vdj_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0.tar.gz"
183+
feature_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/feature_reference.csv"
184+
library_id:
185+
- "5k_human_antiCMV_T_TBNK_connect_GEX_1_subset"
186+
- "5k_human_antiCMV_T_TBNK_connect_AB_subset"
187+
- "5k_human_antiCMV_T_TBNK_connect_VDJ_subset"
188+
library_type:
189+
- "Gene Expression"
190+
- "Antibody Capture"
191+
- "VDJ"
192+
```
193+
194+
You can pass this file to nextflow using `-params-file`
195+
196+
``` bash
197+
nextflow run openpipelines-bio/openpipeline \
198+
-r 2.1.1 \
199+
-main-script target/nextflow/workflows/ingestion/cellranger_multi/main.nf \
200+
-c "<path to resource config file>" \
201+
-profile docker \
202+
-params-file "<path to your parameter YAML file>" \
203+
--publish_dir "<path to output directory>"
204+
```
205+
206+
### Filtering, normalization, clustering, dimensionality reduction and QC calculations (w/o integration)
207+
208+
Once you have an MuData object for each of your samples, you can process
209+
it into a multisample file that is ready for integration and other
210+
downstream analyses. This can be done using the `process_samples`
211+
workflow. Here is an example, but please keep in mind that the exact
212+
parameters that need to be provided differ depending on you data. A lot
213+
of functionality for this pipeline can be customized, including the name
214+
of the output slots where data is being stored.
215+
216+
``` yaml
217+
param_list:
218+
- id: "sample_1"
219+
input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
220+
rna_min_counts: 2
221+
- id: "sample_2"
222+
input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu"
223+
rna_min_counts: 1
224+
rna_max_counts: 1000000
225+
rna_min_genes_per_cell: 1
226+
rna_max_genes_per_cell: 1000000
227+
rna_min_cells_per_gene: 1
228+
rna_min_fraction_mito: 0.0
229+
rna_max_fraction_mito: 1.0
230+
```
231+
232+
In order to provide multiple samples to the pipeline, `param_list` is
233+
used. Using `param_list` it is possible to specify arguments per sample.
234+
However, it is still possible to define arguments for all samples
235+
together by listing those outside the `param_list` block.
236+
237+
``` bash
238+
nextflow run openpipelines-bio/openpipeline \
239+
-r 2.1.1 \
240+
-main-script target/nextflow/workflows/multiomics/process_samples/main.nf \
241+
-c "<path to resource config file>" \
242+
-profile docker \
243+
-params-file "<path to your parameter YAML file>"
244+
--publish_dir "<path to output directory>"
245+
```
246+
247+
## Executing standalone components using the Viash executable
248+
249+
Another option to execute individual modules on the CLI is to use
250+
`viash run`. All you need to do is download viash, clone the
251+
Openpipeline repository and point viash to a config file. However, keep
252+
in mind that using `viash run` for workflows is currently not supported.
253+
Please see `viash run --help` for more information on how to use the
254+
command, but here is an example:
255+
256+
``` bash
257+
viash run --engine docker src/mapping/cellranger_multi/config.vsh.yaml --help
258+
```

0 commit comments

Comments
 (0)