Skip to content

Add BiG-SLiCE to BGC subworkflow#519

Open
SkyLexS wants to merge 17 commits intonf-core:devfrom
SkyLexS:big_slice_imp
Open

Add BiG-SLiCE to BGC subworkflow#519
SkyLexS wants to merge 17 commits intonf-core:devfrom
SkyLexS:big_slice_imp

Conversation

@SkyLexS
Copy link

@SkyLexS SkyLexS commented Mar 10, 2026

Closes #515

PR checklist

  • Integrated bigslice into funcscan pipeline (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/funcscan branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

@jfy133 jfy133 changed the title Big slice imp Add BiG-SLiCE to BGC subworkflow Mar 20, 2026
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a more thorough review when I'm in my other office (train jsut turned around halfway to the planned one for today 🙃)

// BIGSLICE
if (params.bgc_bigslice_run) {

if (params.bgc_skip_antismash && (params.bgc_skip_gecco || !params.bgc_gecco_runconvert || params.bgc_gecco_convertformat != 'bigslice')) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check, and the error in teh if/else below should go in input validation in subworkflows/local/utils_nfcore_funcscan<...>, as we want the pipeline to fail at the beginning if it's missing with this particular parameter combination - not halfway through the pipeline execution :)

ch_bigslice_input = ch_bigslice_input.mix(
GECCO_CONVERT.out.bigslice
)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tihnk you could simplify this into a single if/else statemetn, with each condition assigning ch_bigslice_input ,rather than making an empty input channel and mixing n.

docs/usage.md Outdated

[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:

- antiSMASH (default BGC tool), **or**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- antiSMASH (default BGC tool), **or**
- antiSMASH (default BGC tool).

docs/usage.md Outdated

### BiGSLiCE

[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:
[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs).
It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:

docs/usage.md Outdated
- antiSMASH (default BGC tool), **or**
- GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice`

BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.
BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input.
The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.

docs/output.md Outdated

</details>

[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.
[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs).
It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments.
BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.

Are there no ohter outpu t files other than the data.db? I'm sort of surprised

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there no ohter outpu t files other than the data.db? I'm sort of surprised

BiG-SLiCE dumps everything into data.db by design, it's meant to be queried as a database.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to their latest article, there is also a setting called --export-tsv which was recently introduced in BiG-SLiCE v. 2.0:

In addition, in response to high user demand, we have introduced a new program parameter (‘--export-tsv’) that allows users to export all parsed BGC metadata, vectorized features, and clustering results as tab-separated text files (TSVs).

@SkyLexS
Copy link
Author

SkyLexS commented Mar 20, 2026

I'll do a more thorough review when I'm in my other office (train jsut turned around halfway to the planned one for today 🙃)

No worries, review when you have time

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good first pass! Mostly missing a few things :)

Major things

All compared to the developer docs I recently added here: https://github.com/nf-core/funcscan/blob/dev/docs/usage/developing.md

  • Missing updates to tests configs and the nf-test test files/snapshots - how big is teh BiGSLiCE database? You might need to make a smaller one
  • Missing citation information in CITATIONS.md, and in utils_nfcore_funcscan_pipeline
  • Don't forget to add yourself to the README and manifest section of nextflow.config!

ch_versions = ch_versions.mix(GECCO_CONVERT.out.versions)
}
// BIGSLICE
if (params.bgc_bigslice_run) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (params.bgc_bigslice_run) {
if (params.bgc_run_bigslice) {

And updated everywhere, because 1

Footnotes

  1. https://github.com/nf-core/funcscan/blob/dev/docs/usage/developing.md#pipeline-specific-conventions

@@ -159,14 +159,14 @@
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here except the new parameters should be reverted!

},
"bgc_bigslice_db": {
"type": "string",
"description": "Path to the pre-downloaded BiG-SLiCE HMM database directory."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A long form help text saying where the database would come from would be nice, and also add the Modifies tool parameter(s):\ text and list the database parameter the database goes into in BiG-SLiCE itself


### BiGSLiCE

BiG-SLiCE requires its own HMM database. Unlike most other tools, the pipeline does **not** auto-download this database — it **must** be supplied manually with `--bgc_bigslice_db`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there is a reason why you don't do this? It appears there is a tool in the package for this: https://github.com/medema-group/bigslice?tab=readme-ov-file#quick-start

II thin kthis swould be important given the models look like they could change more regularly than the package itself.

Also hardcoding the URL is risky, also given I don't see any further docs on the bigslice on how to get the database files other than the above.

(I htink the 271MB database file might be OK for our test runs, if it downloads fast - but you might need to look inside and see if you make a smaller version.

It looks like there are some docs here on how to do it : https://github.com/medema-group/bigslice/tree/master/bigslice/db/advanced

}

withName: BIGSLICE {
publishDir = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there no parameters a user may want to refine here and we should expose to the user at a pipeline level?

Looking here, looks to me tehre are lots of 'threshold' like parameters (e.g. -n_ranks, --threshold etc)

- GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice`

BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input.
The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.
The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#databases-and-reference files-for-details); it is not auto-downloaded by the pipeline.


[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs).
It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments.
BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add asentence after this saying what you wuld use the data.db file for... as it's not a standard thing you would use (arguably)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants