Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.5.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
jfy133
left a comment
There was a problem hiding this comment.
I'll do a more thorough review when I'm in my other office (train jsut turned around halfway to the planned one for today 🙃)
subworkflows/local/bgc.nf
Outdated
| // BIGSLICE | ||
| if (params.bgc_bigslice_run) { | ||
|
|
||
| if (params.bgc_skip_antismash && (params.bgc_skip_gecco || !params.bgc_gecco_runconvert || params.bgc_gecco_convertformat != 'bigslice')) { |
There was a problem hiding this comment.
This check, and the error in teh if/else below should go in input validation in subworkflows/local/utils_nfcore_funcscan<...>, as we want the pipeline to fail at the beginning if it's missing with this particular parameter combination - not halfway through the pipeline execution :)
| ch_bigslice_input = ch_bigslice_input.mix( | ||
| GECCO_CONVERT.out.bigslice | ||
| ) | ||
| } |
There was a problem hiding this comment.
I tihnk you could simplify this into a single if/else statemetn, with each condition assigning ch_bigslice_input ,rather than making an empty input channel and mixing n.
docs/usage.md
Outdated
|
|
||
| [BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled: | ||
|
|
||
| - antiSMASH (default BGC tool), **or** |
There was a problem hiding this comment.
| - antiSMASH (default BGC tool), **or** | |
| - antiSMASH (default BGC tool). |
docs/usage.md
Outdated
|
|
||
| ### BiGSLiCE | ||
|
|
||
| [BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled: |
There was a problem hiding this comment.
| [BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled: | |
| [BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). | |
| It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled: |
docs/usage.md
Outdated
| - antiSMASH (default BGC tool), **or** | ||
| - GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice` | ||
|
|
||
| BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline. |
There was a problem hiding this comment.
| BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline. | |
| BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. | |
| The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline. |
docs/output.md
Outdated
|
|
||
| </details> | ||
|
|
||
| [BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled. |
There was a problem hiding this comment.
| [BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled. | |
| [BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). | |
| It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. | |
| BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled. |
Are there no ohter outpu t files other than the data.db? I'm sort of surprised
There was a problem hiding this comment.
Are there no ohter outpu t files other than the
data.db? I'm sort of surprised
BiG-SLiCE dumps everything into data.db by design, it's meant to be queried as a database.
There was a problem hiding this comment.
According to their latest article, there is also a setting called --export-tsv which was recently introduced in BiG-SLiCE v. 2.0:
In addition, in response to high user demand, we have introduced a new program parameter (‘--export-tsv’) that allows users to export all parsed BGC metadata, vectorized features, and clustering results as tab-separated text files (TSVs).
No worries, review when you have time |
jfy133
left a comment
There was a problem hiding this comment.
Good first pass! Mostly missing a few things :)
Major things
All compared to the developer docs I recently added here: https://github.com/nf-core/funcscan/blob/dev/docs/usage/developing.md
- Missing updates to tests configs and the nf-test test files/snapshots - how big is teh BiGSLiCE database? You might need to make a smaller one
- Missing citation information in CITATIONS.md, and in utils_nfcore_funcscan_pipeline
- Don't forget to add yourself to the README and
manifestsection of nextflow.config!
| ch_versions = ch_versions.mix(GECCO_CONVERT.out.versions) | ||
| } | ||
| // BIGSLICE | ||
| if (params.bgc_bigslice_run) { |
There was a problem hiding this comment.
| if (params.bgc_bigslice_run) { | |
| if (params.bgc_run_bigslice) { |
And updated everywhere, because 1
Footnotes
| @@ -159,14 +159,14 @@ | |||
| }, | |||
There was a problem hiding this comment.
The changes here except the new parameters should be reverted!
| }, | ||
| "bgc_bigslice_db": { | ||
| "type": "string", | ||
| "description": "Path to the pre-downloaded BiG-SLiCE HMM database directory." |
There was a problem hiding this comment.
A long form help text saying where the database would come from would be nice, and also add the Modifies tool parameter(s):\ text and list the database parameter the database goes into in BiG-SLiCE itself
|
|
||
| ### BiGSLiCE | ||
|
|
||
| BiG-SLiCE requires its own HMM database. Unlike most other tools, the pipeline does **not** auto-download this database — it **must** be supplied manually with `--bgc_bigslice_db`. |
There was a problem hiding this comment.
Is there is a reason why you don't do this? It appears there is a tool in the package for this: https://github.com/medema-group/bigslice?tab=readme-ov-file#quick-start
II thin kthis swould be important given the models look like they could change more regularly than the package itself.
Also hardcoding the URL is risky, also given I don't see any further docs on the bigslice on how to get the database files other than the above.
(I htink the 271MB database file might be OK for our test runs, if it downloads fast - but you might need to look inside and see if you make a smaller version.
It looks like there are some docs here on how to do it : https://github.com/medema-group/bigslice/tree/master/bigslice/db/advanced
| } | ||
|
|
||
| withName: BIGSLICE { | ||
| publishDir = [ |
There was a problem hiding this comment.
Are there no parameters a user may want to refine here and we should expose to the user at a pipeline level?
Looking here, looks to me tehre are lots of 'threshold' like parameters (e.g. -n_ranks, --threshold etc)
| - GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice` | ||
|
|
||
| BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. | ||
| The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline. |
There was a problem hiding this comment.
| The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline. | |
| The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#databases-and-reference files-for-details); it is not auto-downloaded by the pipeline. |
|
|
||
| [BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). | ||
| It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. | ||
| BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled. |
There was a problem hiding this comment.
I would add asentence after this saying what you wuld use the data.db file for... as it's not a standard thing you would use (arguably)
Closes #515
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).