|
1 | | -OpenPipeline |
2 | | -================ |
3 | 1 |
|
4 | | -<!-- README.md is generated by running 'quarto render README.qmd' --> |
| 2 | + |
| 3 | +# OpenPipeline |
5 | 4 |
|
6 | 5 | Extensible single cell analysis pipelines for reproducible and |
7 | 6 | large-scale single cell processing using Viash and Nextflow. |
8 | 7 |
|
9 | | -The provided pipelines are built using the [Viash |
10 | | -framework](http://www.viash.io) on top of the nextflow workflow system. |
11 | | -For more information on Nextflow please visit the [Nextflow github |
12 | | -page](https://github.com/nextflow-io/nextflow) and the [Nextflow read |
13 | | -the docs page](https://www.nextflow.io/docs/latest/index.html). |
| 8 | +[](https://www.viash-hub.com/packages/openpipeline) |
| 9 | +[](https://github.com/openpipelines-bio/openpipeline) |
| 10 | +[](https://github.com/openpipelines-bio/openpipeline/blob/main/LICENSE) |
| 12 | +[](https://github.com/openpipelines-bio/openpipeline/issues) |
| 14 | +[](https://viash.io) |
| 16 | + |
| 17 | +## Documentation |
| 18 | + |
| 19 | +Please find more in-depth documentation on [the |
| 20 | +website](https://openpipelines.bio/). |
| 21 | + |
| 22 | +## Functionality Overview |
| 23 | + |
| 24 | +Openpipelines execute a list of predefined tasks. These descrete steps |
| 25 | +are also provided as standalone components that can be executed |
| 26 | +individually, with a standardized interface. This is especially useful |
| 27 | +when a particular step wraps a tool that you do not necessarily always |
| 28 | +need to execute in a workflow context. |
| 29 | + |
| 30 | +In terms of workflows, the following functionality is provided: |
| 31 | + |
| 32 | +- Demultiplexing: conversion of raw sequencing data to FASTQ objects. |
| 33 | +- [Ingestion](https://openpipelines.bio/fundamentals/architecture.html#sec-ingestion): |
| 34 | + Read mapping and generating a count matrix. |
| 35 | +- [Single sample |
| 36 | + processing](https://openpipelines.bio/fundamentals/architecture.html#sec-single-sample): |
| 37 | + cell filtering and doublet detection. |
| 38 | +- [Multisample |
| 39 | + processing](https://openpipelines.bio/fundamentals/architecture.html#sec-multisample-processing): |
| 40 | + Count transformation, normalization, QC metric calulations. |
| 41 | +- [Integration](https://openpipelines.bio/fundamentals/architecture.html#sec-intergration): |
| 42 | + Clustering, integration and batch correction using single and |
| 43 | + multimodal methods. |
| 44 | +- Downstream analysis workflows |
| 45 | + |
| 46 | +``` mermaid lang="mermaid" |
| 47 | +flowchart LR |
| 48 | + demultiplexing["Step 1: Demultiplexing"] |
| 49 | + ingestion["Step 2: Ingestion"] |
| 50 | + process_samples["Step 3: Process Samples"] |
| 51 | + integration["Step 4: Integration"] |
| 52 | + downstream["Step 5: Downstream"] |
| 53 | + demultiplexing-->ingestion-->process_samples-->integration-->downstream |
| 54 | +``` |
| 55 | + |
| 56 | +## Guided execution using Viash Hub (CLI and Seqera cloud) |
| 57 | + |
| 58 | +Openpipelines is now available on [Viash |
| 59 | +Hub](https://www.viash-hub.com/packages/openpipeline/latest). Viash Hub |
| 60 | +provides a list of components and workflows, together with a graphical |
| 61 | +interface that guides you through the steps of running a workflow or |
| 62 | +standalone component. Intstructions are provided for using a local viash |
| 63 | +or nextflow executable (requires using a linux based OS), but connecting |
| 64 | +to a Seqera cloud instance is also supported. |
| 65 | + |
| 66 | +## Execution using the nextflow executable |
| 67 | + |
| 68 | +Executing a workflow is a bit more involved and requires familiarity |
| 69 | +with the command line interface (CLI). |
| 70 | + |
| 71 | +### Setup |
| 72 | + |
| 73 | +In order to use the workflows in this package on your local computer, |
| 74 | +you’ll need to do the following: |
| 75 | + |
| 76 | +- Install [nextflow](https://www.nextflow.io/docs/latest/install.html) |
| 77 | +- Install a nextflow compatible executor. This workflow provides a |
| 78 | + profile for [docker](https://docs.docker.com/get-started/). |
| 79 | + |
| 80 | +### Location of the workflow scripts |
| 81 | + |
| 82 | +Nextflow workflow scripts, schema’s and configuration files can be found |
| 83 | +in the `target/nextflow` folder. On the `main` branch however, only the |
| 84 | +source code that needs to be build into the functionning workflows and |
| 85 | +components can be found. Instead, please refer to the `main_build` |
| 86 | +branch or any of the tags to find the `target` folders. Components and |
| 87 | +workflows are organized into namespaces, which can be nested. Workflows |
| 88 | +are located at `target/nextflow/workflows`, while components that |
| 89 | +execute individual workflow steps are |
| 90 | + |
| 91 | +A reference of workflows and modules is also provided in the |
| 92 | +[documentation](https://openpipelines.bio/components/). |
| 93 | + |
| 94 | +### Retrieving a list of a workflow parameters |
| 95 | + |
| 96 | +A list of workflows arguments can be consulted in multiple ways: |
| 97 | + |
| 98 | +- On [Viash Hub](https://www.viash-hub.com/packages/openpipeline/latest) |
| 99 | +- In the [reference |
| 100 | + documentation](https://openpipelines.bio/components/) |
| 101 | +- The config YAML file lists the argument for each workflow and |
| 102 | + component |
| 103 | +- In the `target/nextflow` folder, a nextflow schema JSON file |
| 104 | + (`nextflow_schema.json`) is provided next to each workflow `.nf` file. |
| 105 | +- Using nextflow on the CLI: |
| 106 | + |
| 107 | +``` bash |
| 108 | +nextflow run openpipelines-bio/openpipeline \ |
| 109 | + -r 2.1.1 \ |
| 110 | + -main-script target/nextflow/workflows/ingestion/demux/main.nf \ |
| 111 | + --help |
| 112 | +``` |
| 113 | + |
| 114 | +### Resource usage tuning |
| 115 | + |
| 116 | +Nextflow’s labels can be used to specify the amount of resources a |
| 117 | +process can use. This workflow uses the following labels for CPU, memory |
| 118 | +and disk: |
| 119 | + |
| 120 | +- `lowmem`, `lowmem`, `midmem`, `highmem`, `veryhighmem` |
| 121 | +- `lowcpu`, `lowcpu`, `midcpu`, `highcpu`, `veryhighcpu` |
| 122 | +- `lowdisk`, `middisk`, `highdisk`, `veryhighdisk` |
| 123 | + |
| 124 | +The defaults for these labels can be found at |
| 125 | +`src/workflows/utils/labels.config`. Nextflow checks that the specified |
| 126 | +resources for a process do not exceed what is available on the machine |
| 127 | +and will not start if it does. Create your own config file to tune the |
| 128 | +labels to your needs, for example: |
| 129 | + |
| 130 | + // Resource labels |
| 131 | + withLabel: verylowcpu { cpus = 2 } |
| 132 | + withLabel: lowcpu { cpus = 8 } |
| 133 | + withLabel: midcpu { cpus = 16 } |
| 134 | + withLabel: highcpu { cpus = 16 } |
| 135 | + |
| 136 | + withLabel: verylowmem { memory = 4.GB } |
| 137 | + withLabel: lowmem { memory = 8.GB } |
| 138 | + withLabel: midmem { memory = 16.GB } |
| 139 | + withLabel: highmem { memory = 32.GB } |
| 140 | + |
| 141 | +When starting nextflow using the CLI, you can use `-c` to provide the |
| 142 | +file to nextflow and overwrite the defaults. |
| 143 | + |
| 144 | +### Demultiplexing example |
| 145 | + |
| 146 | +Here, generating FASTQ files from raw sequencing data is demonstrated, |
| 147 | +based on data generated using 10X genomic’s protocols. However, BD |
| 148 | +genomics data is also supported by Openpipeline. If you wish to try it |
| 149 | +out yourself, test data is available at |
| 150 | +`s3://openpipelines-data/cellranger_tiny_bcl/bcl`. |
| 151 | + |
| 152 | +``` bash |
| 153 | +nextflow run openpipelines-bio/openpipeline \ |
| 154 | + -r 2.1.1 \ |
| 155 | + -main-script target/nextflow/workflows/ingestion/demux/main.nf \ |
| 156 | + -c "<path to resource config file>" \ |
| 157 | + -profile docker \ |
| 158 | + --publish_dir "<path to output directory>" \ |
| 159 | + --id "cellranger_tiny_bcl" \ |
| 160 | + --input "s3://openpipelines-data/cellranger_tiny_bcl/bcl" \ |
| 161 | + --sample_sheet "s3://openpipelines-data/cellranger_tiny_bcl/bcl/sample_sheet.csv" \ |
| 162 | + --demultiplexer "mkfastq" |
| 163 | +``` |
| 164 | + |
| 165 | +### Mapping and read counting |
| 166 | + |
| 167 | +FASTQ files can be mapped to a reference genome and the resulting mapped |
| 168 | +reads can be counted in order to generate a count matrix. Both |
| 169 | +`BD Rhapsody` and `Cell Ranger` are supported. Here, we demonstrate |
| 170 | +using Cell Ranger multi on test data available at |
| 171 | +`s3://openpipelines-data/10x_5k_anticmv`. |
| 172 | + |
| 173 | +In order to facilitate passing multiple argument values, the parameters |
| 174 | +can be specified using a YAML file. |
| 175 | + |
| 176 | +``` yaml |
| 177 | +input: |
| 178 | + - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_GEX_*.fastq.gz" |
| 179 | + - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_AB_*.fastq.gz" |
| 180 | + - "s3://openpipelines-data/10x_5k_anticmv/raw/5k_human_antiCMV_T_TBNK_connect_VDJ_*.fastq.gz" |
| 181 | +gex_reference: "s3://openpipelines-data/reference_gencodev41_chr1/reference_cellranger.tar.gz" |
| 182 | +vdj_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.0.0.tar.gz" |
| 183 | +feature_reference: "s3://openpipelines-data/10x_5k_anticmv/raw/feature_reference.csv" |
| 184 | +library_id: |
| 185 | + - "5k_human_antiCMV_T_TBNK_connect_GEX_1_subset" |
| 186 | + - "5k_human_antiCMV_T_TBNK_connect_AB_subset" |
| 187 | + - "5k_human_antiCMV_T_TBNK_connect_VDJ_subset" |
| 188 | +library_type: |
| 189 | + - "Gene Expression" |
| 190 | + - "Antibody Capture" |
| 191 | + - "VDJ" |
| 192 | +``` |
| 193 | +
|
| 194 | +You can pass this file to nextflow using `-params-file` |
| 195 | + |
| 196 | +``` bash |
| 197 | +nextflow run openpipelines-bio/openpipeline \ |
| 198 | + -r 2.1.1 \ |
| 199 | + -main-script target/nextflow/workflows/ingestion/cellranger_multi/main.nf \ |
| 200 | + -c "<path to resource config file>" \ |
| 201 | + -profile docker \ |
| 202 | + -params-file "<path to your parameter YAML file>" \ |
| 203 | + --publish_dir "<path to output directory>" |
| 204 | +``` |
| 205 | + |
| 206 | +### Filtering, normalization, clustering, dimensionality reduction and QC calculations (w/o integration) |
| 207 | + |
| 208 | +Once you have an MuData object for each of your samples, you can process |
| 209 | +it into a multisample file that is ready for integration and other |
| 210 | +downstream analyses. This can be done using the `process_samples` |
| 211 | +workflow. Here is an example, but please keep in mind that the exact |
| 212 | +parameters that need to be provided differ depending on you data. A lot |
| 213 | +of functionality for this pipeline can be customized, including the name |
| 214 | +of the output slots where data is being stored. |
| 215 | + |
| 216 | +``` yaml |
| 217 | +param_list: |
| 218 | + - id: "sample_1" |
| 219 | + input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu" |
| 220 | + rna_min_counts: 2 |
| 221 | + - id: "sample_2" |
| 222 | + input: "s3://openpipelines-data/concat_test_data/e18_mouse_brain_fresh_5k_filtered_feature_bc_matrix_subset_unique_obs.h5mu" |
| 223 | + rna_min_counts: 1 |
| 224 | +rna_max_counts: 1000000 |
| 225 | +rna_min_genes_per_cell: 1 |
| 226 | +rna_max_genes_per_cell: 1000000 |
| 227 | +rna_min_cells_per_gene: 1 |
| 228 | +rna_min_fraction_mito: 0.0 |
| 229 | +rna_max_fraction_mito: 1.0 |
| 230 | +``` |
| 231 | + |
| 232 | +In order to provide multiple samples to the pipeline, `param_list` is |
| 233 | +used. Using `param_list` it is possible to specify arguments per sample. |
| 234 | +However, it is still possible to define arguments for all samples |
| 235 | +together by listing those outside the `param_list` block. |
| 236 | + |
| 237 | +``` bash |
| 238 | +nextflow run openpipelines-bio/openpipeline \ |
| 239 | + -r 2.1.1 \ |
| 240 | + -main-script target/nextflow/workflows/multiomics/process_samples/main.nf \ |
| 241 | + -c "<path to resource config file>" \ |
| 242 | + -profile docker \ |
| 243 | + -params-file "<path to your parameter YAML file>" |
| 244 | + --publish_dir "<path to output directory>" |
| 245 | +``` |
| 246 | + |
| 247 | +## Executing standalone components using the Viash executable |
| 248 | + |
| 249 | +Another option to execute individual modules on the CLI is to use |
| 250 | +`viash run`. All you need to do is download viash, clone the |
| 251 | +Openpipeline repository and point viash to a config file. However, keep |
| 252 | +in mind that using `viash run` for workflows is currently not supported. |
| 253 | +Please see `viash run --help` for more information on how to use the |
| 254 | +command, but here is an example: |
| 255 | + |
| 256 | +``` bash |
| 257 | +viash run --engine docker src/mapping/cellranger_multi/config.vsh.yaml --help |
| 258 | +``` |
0 commit comments