Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#506](https://github.com/nf-core/funcscan/pull/506) Added support GECCO convert for generation of additional files useful for downstream analysis (by @SkyLexS)
- [#507](https://github.com/nf-core/funcscan/pull/507) Updated to nf-core template v3.5.1 (by @jfy133)
- [#510](https://github.com/nf-core funcscan/pull/510) Fixed code to make Nextflow strict-syntax compliant (by @jfy133)
- [#519](https://github.com/nf-core/funcscan/pull/519)Added BiG-SLiCE (`bigslice`) as a new BGC clustering tool in the BGC subworkflow. BiG-SLiCE clusters BGC sequences detected by antiSMASH and/or GECCO into Gene Cluster Families (GCFs) using an HMM-based approach. Activated with `--bgc_bigslice_run` and requires `--bgc_bigslice_db`. (by @SkyLexS)

### `Fixed`

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ The nf-core/funcscan AWS full test dataset are contigs generated by the MGnify s
4. Annotation of coding sequences from 3. to obtain general protein families and domains with [`InterProScan`](https://github.com/ebi-pf-team/interproscan)
5. Screening contigs for antimicrobial peptide-like sequences with [`ampir`](https://cran.r-project.org/web/packages/ampir/index.html), [`Macrel`](https://github.com/BigDataBiology/macrel), [`HMMER`](http://hmmer.org/), [`AMPlify`](https://github.com/bcgsc/AMPlify)
6. Screening contigs for antibiotic resistant gene-like sequences with [`ABRicate`](https://github.com/tseemann/abricate), [`AMRFinderPlus`](https://github.com/ncbi/amr), [`fARGene`](https://github.com/fannyhb/fargene), [`RGI`](https://card.mcmaster.ca/analyze/rgi), [`DeepARG`](https://bench.cs.vt.edu/deeparg). [`argNorm`](https://github.com/BigDataBiology/argNorm) is used to map the outputs of `DeepARG`, `AMRFinderPlus`, and `ABRicate` to the [`Antibiotic Resistance Ontology`](https://www.ebi.ac.uk/ols4/ontologies/aro) for consistent ARG classification terms.
7. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/)
7. Screening contigs for biosynthetic gene cluster-like sequences with [`antiSMASH`](https://antismash.secondarymetabolites.org), [`BiG-SLiCE`](https://github.com/medema-group/bigslice), [`DeepBGC`](https://github.com/Merck/deepbgc), [`GECCO`](https://gecco.embl.de/), [`HMMER`](http://hmmer.org/)
8. Creating aggregated reports for all samples across the workflows with [`AMPcombi`](https://github.com/Darcy220606/AMPcombi) for AMPs, [`hAMRonization`](https://github.com/pha4ge/hAMRonization) for ARGs, and [`comBGC`](https://raw.githubusercontent.com/nf-core/funcscan/master/bin/comBGC.py) for BGCs
9. Software version and methods text reporting with [`MultiQC`](http://multiqc.info/)

Expand Down
8 changes: 8 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -540,6 +540,14 @@ process {
]
}

withName: BIGSLICE {
publishDir = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there no parameters a user may want to refine here and we should expose to the user at a pipeline level?

Looking here, looks to me tehre are lots of 'threshold' like parameters (e.g. -n_ranks, --threshold etc)

path: { "${params.outdir}/bgc/bigslice/" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
]
}

withName: HAMRONIZATION_ABRICATE {
publishDir = [
path: { "${params.outdir}/arg/hamronization/abricate" },
Expand Down
25 changes: 23 additions & 2 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The output of nf-core/funcscan provides reports for each of the functional group

- **antibiotic resistance genes** (tools: [ABRicate](https://github.com/tseemann/abricate), [AMRFinderPlus](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder), [DeepARG](https://bitbucket.org/gusphdproj/deeparg-ss/src/master), [fARGene](https://github.com/fannyhb/fargene), [RGI](https://card.mcmaster.ca/analyze/rgi) – summarised by [hAMRonization](https://github.com/pha4ge/hAMRonization). Results from ABRicate, AMRFinderPlus, and DeepARG are normalised to [ARO](https://obofoundry.org/ontology/aro.html) by [argNorm](https://github.com/BigDataBiology/argNorm).)
- **antimicrobial peptides** (tools: [Macrel](https://github.com/BigDataBiology/macrel), [AMPlify](https://github.com/bcgsc/AMPlify), [ampir](https://ampir.marine-omics.net), [hmmsearch](http://hmmer.org) – summarised by [AMPcombi](https://github.com/Darcy220606/AMPcombi))
- **biosynthetic gene clusters** (tools: [antiSMASH](https://docs.antismash.secondarymetabolites.org), [DeepBGC](https://github.com/Merck/deepbgc), [GECCO](https://gecco.embl.de), [hmmsearch](http://hmmer.org) – summarised by [comBGC](#combgc))
- **biosynthetic gene clusters** (tools: [antiSMASH](https://docs.antismash.secondarymetabolites.org), [BiGSLiCE](https://github.com/medema-group/bigslice), [DeepBGC](https://github.com/Merck/deepbgc), [GECCO](https://gecco.embl.de), [hmmsearch](http://hmmer.org) – summarised by [comBGC](#combgc))

As a general workflow, we recommend to first look at the summary reports ([ARGs](#hamronization), [AMPs](#ampcombi), [BGCs](#combgc)), to get a general overview of what hits have been found across all the tools of each functional group. After which, you can explore the specific output directories of each tool to get more detailed information about each result. The tool-specific output directories also includes the output from the functional annotation steps of either [prokka](https://github.com/tseemann/prokka), [pyrodigal](https://github.com/althonos/pyrodigal), [prodigal](https://github.com/hyattpd/Prodigal), or [Bakta](https://github.com/oschwengers/bakta) if the `--save_annotations` flag was set. Additionally, taxonomic classifications from [MMseqs2](https://github.com/soedinglab/MMseqs2) are saved if the `--taxa_classification_mmseqs_db_savetmp` and `--taxa_classification_mmseqs_taxonomy_savetmp` flags are set.

Expand Down Expand Up @@ -38,6 +38,7 @@ results/
| └── rgi/
├── bgc/
| ├── antismash/
| ├── bigslice/
| ├── deepbgc/
| ├── gecco/
| └── hmmsearch/
Expand Down Expand Up @@ -98,6 +99,7 @@ Antimicrobial Peptides (AMPs):
Biosynthetic Gene Clusters (BGCs):

- [antiSMASH](#antismash) – biosynthetic gene cluster detection.
- [BiGSLiCE](#bigslice) – biosynthetic gene cluster super-linear clustering engine.
- [deepBGC](#deepbgc) - biosynthetic gene cluster detection, using a deep learning model.
- [GECCO](#gecco) – biosynthetic gene cluster detection, using Conditional Random Fields (CRFs).
- [hmmsearch](#hmmsearch) – biosynthetic gene cluster detection, based on hidden Markov models.
Expand Down Expand Up @@ -386,7 +388,7 @@ Output Summaries:

### BGC detection tools

[antiSMASH](#antismash), [deepBGC](#deepbgc), [GECCO](#gecco), [hmmsearch](#hmmsearch).
[antiSMASH](#antismash), [BiGSLiCE](#bigslice), [deepBGC](#deepbgc), [GECCO](#gecco), [hmmsearch](#hmmsearch).

Note that the BGC tools are run on a set of annotations generated on only long contigs (3000 bp or longer) by default. These specific filtered FASTA files are under `bgc/seqkit/`, and annotations files are under `annotation/<annotation_tool>/long/`, if the corresponding saving flags are specified (see [parameter docs](https://nf-co.re/funcscan/parameters)). However the same annotations _should_ also be annotation files in the sister `all/` directory.

Expand Down Expand Up @@ -428,6 +430,25 @@ Note that filtered FASTA is only used for BGC workflow for run-time optimisation

[antiSMASH](https://docs.antismash.secondarymetabolites.org) (**anti**biotics & **S**econdary **M**etabolite **A**nalysis **SH**ell) is a tool for rapid genome-wide identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial genomes. It identifies biosynthetic loci covering all currently known secondary metabolite compound classes in a rule-based fashion using profile HMMs and aligns the identified regions at the gene cluster level to their nearest relatives from a database containing experimentally verified gene clusters (MIBiG).

#### BiGSLiCE

<details markdown="1">
<summary>Output files</summary>

- `bigslice/`
- `<samplename>/`
- `result/`
- `data.db`: SQLite database containing results for BGCs, CDSs, Gene Cluster Families (GCFs), HMMs and HSPs.
- `tmp/`
- `<run_id>/`
- `*.fa`: predicted biosynthetic features as FASTA files, one file per hit HMM.

</details>

[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs).
It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments.
BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add asentence after this saying what you wuld use the data.db file for... as it's not a standard thing you would use (arguably)


#### deepBGC

<details markdown="1">
Expand Down
30 changes: 30 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,17 @@ When the annotation is run with Prokka, the resulting `.gbk` file passed to anti
If antiSMASH is run for BGC detection, we recommend to **not** run Prokka for annotation but instead use the default annotation tool (Pyrodigal), or switch to Prodigal or (for bacteria only!) Bakta.
:::

### BiGSLiCE

[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs).
It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:

- antiSMASH (default BGC tool).
- GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice`

BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input.
The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.
The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#databases-and-reference files-for-details); it is not auto-downloaded by the pipeline.


## Databases and reference files

Various tools of nf-core/funcscan use databases and reference files to operate.
Expand Down Expand Up @@ -513,6 +524,25 @@ deepbgc_db/
└── myDetectors*.pkl
```

### BiGSLiCE

BiG-SLiCE requires its own HMM database. Unlike most other tools, the pipeline does **not** auto-download this database — it **must** be supplied manually with `--bgc_bigslice_db`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there is a reason why you don't do this? It appears there is a tool in the package for this: https://github.com/medema-group/bigslice?tab=readme-ov-file#quick-start

II thin kthis swould be important given the models look like they could change more regularly than the package itself.

Also hardcoding the URL is risky, also given I don't see any further docs on the bigslice on how to get the database files other than the above.

(I htink the 271MB database file might be OK for our test runs, if it downloads fast - but you might need to look inside and see if you make a smaller version.

It looks like there are some docs here on how to do it : https://github.com/medema-group/bigslice/tree/master/bigslice/db/advanced


Download the pre-built database archive from the BiG-SLiCE GitHub releases page:

```bash
wget https://github.com/medema-group/bigslice/releases/download/v2.0.0rc/bigslice-models.2022-11-30.tar.gz
tar -xzf bigslice-models.2022-11-30.tar.gz
```

Then supply the extracted directory to the pipeline:

```bash
--bgc_bigslice_db '/<path>/<to>/<bigslice-models>/'
```

The contents of the database directory should contain subdirectories such as `biosynthetic_pfams/` and `sub_pfams/` in the top level.

### InterProScan

[InterProScan](https://github.com/ebi-pf-team/interproscan) is used to provide more information about the proteins annotated on the contigs. By default, turning on this subworkflow with `--run_protein_annotation` will download and unzip the [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.72-103.0/) version 5.72-103.0. The database can be saved in the output directory `<output_directory>/databases/interproscan/` if the `--save_db` is turned on.
Expand Down
5 changes: 5 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,11 @@
"git_sha": "72c983560c9b9c2a02ff636451a5e5008f7d020b",
"installed_by": ["modules"]
},
"bigslice": {
"branch": "master",
"git_sha": "875cf13d1c974d62483fddd55a02456880363b5c",
"installed_by": ["modules"]
},
"deeparg/downloaddata": {
"branch": "master",
"git_sha": "81880787133db07d9b4c1febd152c090eb8325dc",
Expand Down
7 changes: 7 additions & 0 deletions modules/nf-core/bigslice/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

54 changes: 54 additions & 0 deletions modules/nf-core/bigslice/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

90 changes: 90 additions & 0 deletions modules/nf-core/bigslice/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading