-
Notifications
You must be signed in to change notification settings - Fork 35
Add BiG-SLiCE to BGC subworkflow #519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Changes from all commits
dc3dc26
2300eb6
722d4bb
69168d3
e254400
040dd98
a690c8b
0bde3ec
4f4f630
9322394
48e6043
879f1c6
534dab5
057d142
afb50a1
925ef7e
59c0733
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,7 +6,7 @@ The output of nf-core/funcscan provides reports for each of the functional group | |
|
|
||
| - **antibiotic resistance genes** (tools: [ABRicate](https://github.com/tseemann/abricate), [AMRFinderPlus](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder), [DeepARG](https://bitbucket.org/gusphdproj/deeparg-ss/src/master), [fARGene](https://github.com/fannyhb/fargene), [RGI](https://card.mcmaster.ca/analyze/rgi) – summarised by [hAMRonization](https://github.com/pha4ge/hAMRonization). Results from ABRicate, AMRFinderPlus, and DeepARG are normalised to [ARO](https://obofoundry.org/ontology/aro.html) by [argNorm](https://github.com/BigDataBiology/argNorm).) | ||
| - **antimicrobial peptides** (tools: [Macrel](https://github.com/BigDataBiology/macrel), [AMPlify](https://github.com/bcgsc/AMPlify), [ampir](https://ampir.marine-omics.net), [hmmsearch](http://hmmer.org) – summarised by [AMPcombi](https://github.com/Darcy220606/AMPcombi)) | ||
| - **biosynthetic gene clusters** (tools: [antiSMASH](https://docs.antismash.secondarymetabolites.org), [DeepBGC](https://github.com/Merck/deepbgc), [GECCO](https://gecco.embl.de), [hmmsearch](http://hmmer.org) – summarised by [comBGC](#combgc)) | ||
| - **biosynthetic gene clusters** (tools: [antiSMASH](https://docs.antismash.secondarymetabolites.org), [BiGSLiCE](https://github.com/medema-group/bigslice), [DeepBGC](https://github.com/Merck/deepbgc), [GECCO](https://gecco.embl.de), [hmmsearch](http://hmmer.org) – summarised by [comBGC](#combgc)) | ||
|
|
||
| As a general workflow, we recommend to first look at the summary reports ([ARGs](#hamronization), [AMPs](#ampcombi), [BGCs](#combgc)), to get a general overview of what hits have been found across all the tools of each functional group. After which, you can explore the specific output directories of each tool to get more detailed information about each result. The tool-specific output directories also includes the output from the functional annotation steps of either [prokka](https://github.com/tseemann/prokka), [pyrodigal](https://github.com/althonos/pyrodigal), [prodigal](https://github.com/hyattpd/Prodigal), or [Bakta](https://github.com/oschwengers/bakta) if the `--save_annotations` flag was set. Additionally, taxonomic classifications from [MMseqs2](https://github.com/soedinglab/MMseqs2) are saved if the `--taxa_classification_mmseqs_db_savetmp` and `--taxa_classification_mmseqs_taxonomy_savetmp` flags are set. | ||
|
|
||
|
|
@@ -38,6 +38,7 @@ results/ | |
| | └── rgi/ | ||
| ├── bgc/ | ||
| | ├── antismash/ | ||
| | ├── bigslice/ | ||
| | ├── deepbgc/ | ||
| | ├── gecco/ | ||
| | └── hmmsearch/ | ||
|
|
@@ -98,6 +99,7 @@ Antimicrobial Peptides (AMPs): | |
| Biosynthetic Gene Clusters (BGCs): | ||
|
|
||
| - [antiSMASH](#antismash) – biosynthetic gene cluster detection. | ||
| - [BiGSLiCE](#bigslice) – biosynthetic gene cluster super-linear clustering engine. | ||
| - [deepBGC](#deepbgc) - biosynthetic gene cluster detection, using a deep learning model. | ||
| - [GECCO](#gecco) – biosynthetic gene cluster detection, using Conditional Random Fields (CRFs). | ||
| - [hmmsearch](#hmmsearch) – biosynthetic gene cluster detection, based on hidden Markov models. | ||
|
|
@@ -386,7 +388,7 @@ Output Summaries: | |
|
|
||
| ### BGC detection tools | ||
|
|
||
| [antiSMASH](#antismash), [deepBGC](#deepbgc), [GECCO](#gecco), [hmmsearch](#hmmsearch). | ||
| [antiSMASH](#antismash), [BiGSLiCE](#bigslice), [deepBGC](#deepbgc), [GECCO](#gecco), [hmmsearch](#hmmsearch). | ||
|
|
||
| Note that the BGC tools are run on a set of annotations generated on only long contigs (3000 bp or longer) by default. These specific filtered FASTA files are under `bgc/seqkit/`, and annotations files are under `annotation/<annotation_tool>/long/`, if the corresponding saving flags are specified (see [parameter docs](https://nf-co.re/funcscan/parameters)). However the same annotations _should_ also be annotation files in the sister `all/` directory. | ||
|
|
||
|
|
@@ -428,6 +430,25 @@ Note that filtered FASTA is only used for BGC workflow for run-time optimisation | |
|
|
||
| [antiSMASH](https://docs.antismash.secondarymetabolites.org) (**anti**biotics & **S**econdary **M**etabolite **A**nalysis **SH**ell) is a tool for rapid genome-wide identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial genomes. It identifies biosynthetic loci covering all currently known secondary metabolite compound classes in a rule-based fashion using profile HMMs and aligns the identified regions at the gene cluster level to their nearest relatives from a database containing experimentally verified gene clusters (MIBiG). | ||
|
|
||
| #### BiGSLiCE | ||
|
|
||
| <details markdown="1"> | ||
| <summary>Output files</summary> | ||
|
|
||
| - `bigslice/` | ||
| - `<samplename>/` | ||
| - `result/` | ||
| - `data.db`: SQLite database containing results for BGCs, CDSs, Gene Cluster Families (GCFs), HMMs and HSPs. | ||
| - `tmp/` | ||
| - `<run_id>/` | ||
| - `*.fa`: predicted biosynthetic features as FASTA files, one file per hit HMM. | ||
|
|
||
| </details> | ||
|
|
||
| [BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). | ||
| It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. | ||
| BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add asentence after this saying what you wuld use the |
||
|
|
||
| #### deepBGC | ||
|
|
||
| <details markdown="1"> | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -155,6 +155,17 @@ When the annotation is run with Prokka, the resulting `.gbk` file passed to anti | |||||
| If antiSMASH is run for BGC detection, we recommend to **not** run Prokka for annotation but instead use the default annotation tool (Pyrodigal), or switch to Prodigal or (for bacteria only!) Bakta. | ||||||
| ::: | ||||||
|
|
||||||
| ### BiGSLiCE | ||||||
|
|
||||||
| [BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). | ||||||
| It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled: | ||||||
|
|
||||||
| - antiSMASH (default BGC tool). | ||||||
| - GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice` | ||||||
|
|
||||||
| BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. | ||||||
| The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## Databases and reference files | ||||||
|
|
||||||
| Various tools of nf-core/funcscan use databases and reference files to operate. | ||||||
|
|
@@ -513,6 +524,25 @@ deepbgc_db/ | |||||
| └── myDetectors*.pkl | ||||||
| ``` | ||||||
|
|
||||||
| ### BiGSLiCE | ||||||
|
|
||||||
| BiG-SLiCE requires its own HMM database. Unlike most other tools, the pipeline does **not** auto-download this database — it **must** be supplied manually with `--bgc_bigslice_db`. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there is a reason why you don't do this? It appears there is a tool in the package for this: https://github.com/medema-group/bigslice?tab=readme-ov-file#quick-start II thin kthis swould be important given the models look like they could change more regularly than the package itself. Also hardcoding the URL is risky, also given I don't see any further docs on the bigslice on how to get the database files other than the above. (I htink the 271MB database file might be OK for our test runs, if it downloads fast - but you might need to look inside and see if you make a smaller version. It looks like there are some docs here on how to do it : https://github.com/medema-group/bigslice/tree/master/bigslice/db/advanced |
||||||
|
|
||||||
| Download the pre-built database archive from the BiG-SLiCE GitHub releases page: | ||||||
|
|
||||||
| ```bash | ||||||
| wget https://github.com/medema-group/bigslice/releases/download/v2.0.0rc/bigslice-models.2022-11-30.tar.gz | ||||||
| tar -xzf bigslice-models.2022-11-30.tar.gz | ||||||
| ``` | ||||||
|
|
||||||
| Then supply the extracted directory to the pipeline: | ||||||
|
|
||||||
| ```bash | ||||||
| --bgc_bigslice_db '/<path>/<to>/<bigslice-models>/' | ||||||
| ``` | ||||||
|
|
||||||
| The contents of the database directory should contain subdirectories such as `biosynthetic_pfams/` and `sub_pfams/` in the top level. | ||||||
|
|
||||||
| ### InterProScan | ||||||
|
|
||||||
| [InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow with `--run_protein_annotation` will download and unzip the [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.72-103.0/) version 5.72-103.0. The database can be saved in the output directory `<output_directory>/databases/interproscan/` if the `--save_db` is turned on. | ||||||
|
|
||||||
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there no parameters a user may want to refine here and we should expose to the user at a pipeline level?
Looking here, looks to me tehre are lots of 'threshold' like parameters (e.g.
-n_ranks,--thresholdetc)