Skip to content

Commit 96bc13a

Browse files
authored
Merge pull request #28 from emo-bon/develop
This PR leads to the 1st release of metaGOflow
2 parents d8499a7 + 843f079 commit 96bc13a

File tree

118 files changed

+166002
-3843
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

118 files changed

+166002
-3843
lines changed

.github/workflows/conda.yml

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,24 @@ on: [push]
33
jobs:
44
cwl_tests:
55
name: Run cwl_tests.sh
6-
runs-on: ubuntu-18.04
6+
runs-on: ubuntu-20.04
77
defaults:
88
run:
99
shell: bash -l {0}
1010
steps:
11+
- name: Install dependencies
12+
run: |
13+
sudo apt-get install -y python3-pip
14+
pip install cwltool lockfile
15+
1116
- uses: actions/checkout@v2
1217
- run: |
13-
ls
18+
ls
19+
20+
- name: Validate workflow
21+
run: |
22+
cwltool --validate workflows/gos_wf.cwl
23+
1424
# - uses: conda-incubator/setup-miniconda@v2
1525
# with:
1626
# activate-environment: anaconda-client-env

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,8 +116,15 @@ venv.bak/
116116

117117
# Ignore real-world test samples
118118
test_input/SRR*
119+
test_input/DB*
119120

120121
# Ignore dev output
121122
TEST_*/
122123
*.output
123124

125+
STELIOS_TEST/
126+
marine_sediment_dbh/
127+
128+
slurm_run.sh
129+
130+

Installation/download_dbs.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ wget $FTP_DBS/kofam_ko_desc.tsv
125125
echo 'Download eggnog dbs'
126126
wget http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog_proteins.dmnd.gz
127127
wget http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.db.gz
128+
gunzip eggnog.db.gz eggnog_proteins.dmnd.gz
128129
mkdir eggnog && mv eggnog_proteins.dmnd eggnog.db eggnog
129130

130131
# Diamond
@@ -139,6 +140,5 @@ echo 'Download pathways data'
139140
wget ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/graphs-20200805.pkl.gz \
140141
ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/all_pathways_class.txt.gz \
141142
ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipeline-5.0/ref-dbs/all_pathways_names.txt.gz
142-
gunzip graphs.pkl.gz all_pathways_class.txt.gz all_pathways_names.txt.gz
143-
mkdir kegg_pathways && mv graphs.pkl all_pathways_class.txt all_pathways_names.txt kegg_pathways
144-
143+
gunzip graphs-20200805.pkl.gz all_pathways_class.txt.gz all_pathways_names.txt.gz
144+
mkdir kegg_pathways && mv graphs-20200805.pkl all_pathways_class.txt all_pathways_names.txt kegg_pathways

README.md

Lines changed: 119 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,166 @@
1-
# metaGOflow: A workflow for marine Genomic Observatories data analysis
1+
# metaGOflow: A workflow for marine Genomic Observatories' data analysis
22

3-
## An EOSC-Life project
3+
![logo](https://raw.githubusercontent.com/hariszaf/metaGOflow-use-case/gh-pages/assets/img/metaGOflow_logo_italics.png)
44

5-
[![Build Status](https://travis-ci.org/EBI-Metagenomics/pipeline-v5.svg?branch=master)](https://travis-ci.com/EBI-Metagenomics/pipeline-v5)
65

7-
The workflows developed in the framework of this project are based on `pipeline-v5` of the MGnify resource.
6+
## An EOSC-Life project
87

9-
> This branch is a child of the [`pipeline_5.1`](https://github.com/hariszaf/pipeline-v5/tree/pipeline_5.1) branch
10-
that contains all CWL descriptions of the MGnify pipeline version 5.1.
8+
The workflows developed in the framework of this project are based on `pipeline-v5` of the MGnify resource.
119

10+
> This branch is a child of the [`pipeline_5.1`](https://github.com/hariszaf/pipeline-v5/tree/pipeline_5.1) branch
11+
> that contains all CWL descriptions of the MGnify pipeline version 5.1.
1212
1313
## Dependencies
1414

15-
- python3 [v 3.7+]
16-
- [Docker](https://www.docker.com) [v 19.+] or [Singularity](https://apptainer.org)
17-
- [cwltool](https://github.com/common-workflow-language/cwltool) [v 3.+]
15+
To run metaGOflow you need to make sure you have the following set on your computing environmnet first:
16+
17+
- python3 [v 3.8+]
18+
- [Docker](https://www.docker.com) [v 19.+] or [Singularity](https://apptainer.org) [v 3.7.+]/[Apptainer](https://apptainer.org) [v 1.+]
19+
- [cwltool](https://github.com/common-workflow-language/cwltool) [v 3.+]
20+
- [rdflib](https://rdflib.readthedocs.io/en/stable/) [v 6.+]
21+
- [rdflib-jsonld](https://pypi.org/project/rdflib-jsonld/) [v 0.6.2]
22+
- [ro-crate-py](https://github.com/ResearchObject/ro-crate-py) [v 0.7.0]
23+
- [pyyaml](https://pypi.org/project/PyYAML/) [v 6.0]
24+
- [Node.js](https://nodejs.org/) [v 10.24.0+]
25+
- Available storage ~235GB for databases
1826

19-
Depending on the analysis you are about to run, disk requirements vary.
27+
### Storage while running
28+
29+
Depending on the analysis you are about to run, disk requirements vary.
2030
Indicatively, you may have a look at the metaGOflow publication for computing resources used in various cases.
2131

32+
## Installation
2233

2334
### Get the EOSC-Life marine GOs workflow
2435

2536
```bash
26-
git clone https://github.com/emo-bon/pipeline-v5.git
27-
cd pipeline-v5
37+
git clone https://github.com/emo-bon/MetaGOflow
38+
cd MetaGOflow
2839
```
2940

30-
31-
### Download necessary databases
41+
### Download necessary databases (~235GB)
3242

3343
You can download databases for the EOSC-Life GOs workflow by running the
3444
`download_dbs.sh` script under the `Installation` folder.
3545

36-
If you have one or more already in your system, then create a symbolic link pointing
37-
at the `ref-dbs` folder.
46+
```bash
47+
bash Installation/download_dbs.sh -f [Output Directory e.g. ref-dbs]
48+
```
49+
If you have one or more already in your system, then create a symbolic link pointing
50+
at the `ref-dbs` folder or at one of its subfolders/files.
51+
52+
The final structure of the DB directory should be like the following:
3853

54+
````bash
55+
user@server:~/MetaGOflow: ls ref-dbs/
56+
db_kofam/ diamond/ eggnog/ GO-slim/ interproscan-5.57-90.0/ kegg_pathways/ kofam_ko_desc.tsv Rfam/ silva_lsu/ silva_ssu/
57+
````
3958

4059
## How to run
4160

61+
### Ensure that `Node.js` is installed on your system before running metaGOflow
62+
63+
If you have root access on your system, you can run the commands below to install it:
64+
65+
##### DEBIAN/UBUNTU
66+
```bash
67+
sudo apt-get update -y
68+
sudo apt-get install -y nodejs
69+
```
70+
71+
##### RH/CentOS
72+
```bash
73+
sudo yum install rh-nodejs<stream version> (e.g. rh-nodejs10)
74+
```
4275

43-
- Edit the `config.yml` file to set the parameter values of your choice.
76+
### Set up the environment
4477

45-
- Make a job file (e.g., SBATCH file) and
78+
#### Run once - Setup environment
4679

47-
- enable Singularity, e.g. `module load Singularity`
80+
- ```bash
81+
conda create -n EOSC-CWL python=3.8
82+
```
83+
84+
- ```bash
85+
conda activate EOSC-CWL
86+
```
87+
88+
- ```bash
89+
pip install cwlref-runner cwltool[all] rdflib-jsonld rocrate pyyaml
90+
91+
```
92+
93+
#### Run every time
94+
95+
```bash
96+
conda activate EOSC-CWL
97+
```
98+
99+
### Run the workflow
100+
101+
- Edit the `config.yml` file to set the parameter values of your choice. For selecting all the steps, then set to `true` the variables in lines [2-6].
102+
103+
#### Using Singularity
104+
105+
##### Standalone
106+
- run:
107+
```bash
108+
./run_wf.sh -s -n osd-short -d short-test-case -f test_input/wgs-paired-SRR1620013_1.fastq.gz -r test_input/wgs-paired-SRR1620013_2.fastq.gz
109+
``
110+
111+
##### Using a cluster with a queueing system (e.g. SLURM)
112+
113+
- Create a job file (e.g., SBATCH file)
114+
115+
- Enable Singularity, e.g. module load Singularity & all other dependencies
116+
117+
- Add the run line to the job file
118+
119+
120+
#### Using Docker
121+
122+
##### Standalone
123+
- run:
124+
``` bash
125+
./run_wf.sh -n osd-short -d short-test-case -f test_input/wgs-paired-SRR1620013_1.fastq.gz -r test_input/wgs-paired-SRR1620013_2.fastq.gz
126+
```
127+
HINT: If you are using Docker, you may need to run the above command without the `-s' flag.
128+
129+
## Testing samples
130+
The samples are available in the `test_input` folder.
131+
132+
We provide metaGOflow with partial samples from the Human Metagenome Project ([SRR1620013](https://www.ebi.ac.uk/ena/browser/view/SRR1620013) and [SRR1620014](https://www.ebi.ac.uk/ena/browser/view/SRR1620014))
133+
They are partial as only a small part of their sequences have been kept, in terms for the pipeline to test in a fast way.
48134
49-
- run:
50-
```
51-
./run_wf.sh -n false -n osd-short -d short-test-case -f test_input/wgs-paired-SRR1620013_1.fastq.gz -r test_input/wgs-paired-SRR1620013_2.fastq.gz
52-
```
53135
54136
## Hints and tips
55137
56138
1. In case you are using Docker, it is strongly recommended to **avoid** installing it through `snap`.
57139
58-
2. `RuntimeError`: slurm currently does not support shared caching, because it does not support cleaning up a worker after the last job finishes.
59-
Set the `--disableCaching` flag if you want to use this batch system.
140+
2. `RuntimeError`: slurm currently does not support shared caching, because it does not support cleaning up a worker
141+
after the last job finishes.
142+
Set the `--disableCaching` flag if you want to use this batch system.
143+
144+
3. In case you are having errors like:
60145
61-
3. In case you are having errors like:
62146
```
63-
wltool.errors.WorkflowException: Singularity is not available for this tool
147+
cwltool.errors.WorkflowException: Singularity is not available for this tool
64148
```
149+
65150
You may run the following command:
151+
66152
```
67153
singularity pull --force --name debian:stable-slim.sif docker://debian:stable-sli
68154
```
69155
70-
71156
## Contribution
72157
73-
To make contribution to the project a bit easier, all the MGnify `conditionals` and `subworkflows` under the `workflows/` directory that are not used in the metaGOflow framework, have been removed.
74-
However, all the MGnify `tools/` and `utils/` are available in this repo, even if they are not invoked in the current version of metaGOflow.
75-
This way, we hope we encourage people to implement their own `conditionals` and/or `subworkflows` by exploiting the currently supported `tools` and `utils` as well as by developing new `tools` and/or `utils`.
76-
158+
To make contribution to the project a bit easier, all the MGnify `conditionals` and `subworkflows` under
159+
the `workflows/` directory that are not used in the metaGOflow framework, have been removed.
160+
However, all the MGnify `tools/` and `utils/` are available in this repo, even if they are not invoked in the current
161+
version of metaGOflow.
162+
This way, we hope we encourage people to implement their own `conditionals` and/or `subworkflows` by exploiting the
163+
currently supported `tools` and `utils` as well as by developing new `tools` and/or `utils`.
77164
78165
79166
<!-- cwltool --print-dot my-wf.cwl | dot -Tsvg > my-wf.svg -->
Binary file not shown.

config.yml

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
# Steps to go for
22
qc_and_merge_step: true
3-
taxonomic_inventory: false
4-
cgc_step: false
5-
reads_functional_annotation: false
3+
taxonomic_inventory: true
4+
cgc_step: true
5+
reads_functional_annotation: true
66
assemble: false
77

88
# Global
9-
threads: 20
9+
threads: 40
10+
11+
# As a rule of thumb keep that as floor(threads/8) where threads the previous parameter
12+
interproscan_threads: 4
1013

1114
# fastp parameters
1215
detect_adapter_for_pe: false
@@ -28,8 +31,8 @@ min-contig-len: 200
2831
# Combined Gene Caller // the size is in MB
2932
cgc_chunk_size: 200
3033

31-
# Taxonomic inference using Diamond and the contigs
32-
diamond_maxTargetSeqs: 1
34+
# # Taxonomic inference using Diamond and the contigs
35+
# diamond_maxTargetSeqs: 1
3336

3437
# Functional annotation
3538
protein_chunk_size_IPS: 2000000
@@ -57,21 +60,21 @@ protein_chunk_size_hmm: 50000
5760
processed_reads: {
5861
class: File,
5962
format: "edam:format_1929",
60-
path: /home1/gmoro/pipeline-v5/test_input/pseudo.merged.fasta
63+
path: workflows/pseudo_files/pseudo.merged.fasta
6164
}
6265

6366
# Mandatory for running the taxonomy inventory step
6467
input_for_motus: {
6568
class: File,
66-
path: /home1/gmoro/pipeline-v5/test_input/pseudo.merged.unfiltered.fasta
69+
path: workflows/pseudo_files/pseudo.merged.unfiltered.fasta
6770
}
6871

6972

7073
# Mandatory for running the functional annotation steps
7174
# If produced previously from metaGOflow, will have a suffix like: .cmsearch.all.tblout.deoverlapped
7275
maskfile: {
7376
class: File,
74-
path: /home1/gmoro/pipeline-v5/test_input/pseudo.merged.cmsearch.all.tblout.deoverlapped
77+
path: workflows/pseudo_files/pseudo.merged.cmsearch.all.tblout.deoverlapped
7578
}
7679

7780
# Mandatory for the functional annotation step
@@ -84,13 +87,13 @@ count_faa_from_previous_run:
8487
predicted_faa_from_previous_run: {
8588
class: File,
8689
format: "edam:format_1929",
87-
path: /home1/gmoro/pipeline-v5/test_input/pseudo.merged_CDS.faa
90+
path: workflows/pseudo_files/pseudo.merged_CDS.faa
8891
}
8992

9093
# Mandatory for running the assembly step
9194
processed_read_files:
9295
- class: File
93-
path: /home1/gmoro/pipeline-v5/test_input/pseudo_1_clean.fastq.trimmed.fasta
96+
path: workflows/pseudo_files/pseudo_1_clean.fastq.trimmed.fasta
9497
- class: File
95-
path: /home1/gmoro/pipeline-v5/test_input/pseudo_2_clean.fastq.trimmed.fasta
98+
path: workflows/pseudo_files/pseudo_2_clean.fastq.trimmed.fasta
9699

dependencies.md

Lines changed: 0 additions & 56 deletions
This file was deleted.

0 commit comments

Comments
 (0)