Add BiG-SLiCE to BGC subworkflow by SkyLexS · Pull Request #519 · nf-core/funcscan

SkyLexS · 2026-03-10T12:28:43Z

Closes #515

PR checklist

nf-core-bot · 2026-03-10T12:29:17Z

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

jfy133

I'll do a more thorough review when I'm in my other office (train jsut turned around halfway to the planned one for today 🙃)

jfy133 · 2026-03-20T08:46:03Z

subworkflows/local/bgc.nf

+    // BIGSLICE
+    if (params.bgc_bigslice_run) {
+
+        if (params.bgc_skip_antismash && (params.bgc_skip_gecco || !params.bgc_gecco_runconvert || params.bgc_gecco_convertformat != 'bigslice')) {


This check, and the error in teh if/else below should go in input validation in subworkflows/local/utils_nfcore_funcscan<...>, as we want the pipeline to fail at the beginning if it's missing with this particular parameter combination - not halfway through the pipeline execution :)

jfy133 · 2026-03-20T08:47:21Z

subworkflows/local/bgc.nf

+            ch_bigslice_input = ch_bigslice_input.mix(
+                GECCO_CONVERT.out.bigslice
+            )
+        }


I tihnk you could simplify this into a single if/else statemetn, with each condition assigning ch_bigslice_input ,rather than making an empty input channel and mixing n.

jfy133 · 2026-03-20T08:50:41Z

docs/usage.md

+
+[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:
+
+- antiSMASH (default BGC tool), **or**


Suggested change

- antiSMASH (default BGC tool), **or**

- antiSMASH (default BGC tool).

jfy133 · 2026-03-20T08:50:57Z

docs/usage.md


+### BiGSLiCE
+
+[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:


Suggested change

[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:

[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs).

It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:

jfy133 · 2026-03-20T08:51:08Z

docs/usage.md

+- antiSMASH (default BGC tool), **or**
+- GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice`
+
+BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.


Suggested change

BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.

BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input.

The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.

jfy133 · 2026-03-20T08:52:34Z

docs/output.md

+
+</details>
+
+[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.


Suggested change

[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.

[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs).

It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments.

BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.

Are there no ohter outpu t files other than the data.db? I'm sort of surprised

Are there no ohter outpu t files other than the data.db? I'm sort of surprised

BiG-SLiCE dumps everything into data.db by design, it's meant to be queried as a database.

According to their latest article, there is also a setting called --export-tsv which was recently introduced in BiG-SLiCE v. 2.0:

In addition, in response to high user demand, we have introduced a new program parameter (‘--export-tsv’) that allows users to export all parsed BGC metadata, vectorized features, and clustering results as tab-separated text files (TSVs).

SkyLexS · 2026-03-20T09:40:24Z

I'll do a more thorough review when I'm in my other office (train jsut turned around halfway to the planned one for today 🙃)

No worries, review when you have time

jfy133

Good first pass! Mostly missing a few things :)

Major things

All compared to the developer docs I recently added here: https://github.com/nf-core/funcscan/blob/dev/docs/usage/developing.md

Missing updates to tests configs and the nf-test test files/snapshots - how big is teh BiGSLiCE database? You might need to make a smaller one
Missing citation information in CITATIONS.md, and in utils_nfcore_funcscan_pipeline
Don't forget to add yourself to the README and manifest section of nextflow.config!

jfy133 · 2026-03-20T09:45:15Z

subworkflows/local/bgc.nf

        ch_versions = ch_versions.mix(GECCO_CONVERT.out.versions)
    }
+    // BIGSLICE
+    if (params.bgc_bigslice_run) {


Suggested change

if (params.bgc_bigslice_run) {

if (params.bgc_run_bigslice) {

And updated everywhere, because ¹

Footnotes

https://github.com/nf-core/funcscan/blob/dev/docs/usage/developing.md#pipeline-specific-conventions ↩

jfy133 · 2026-03-20T09:47:36Z

nextflow_schema.json

@@ -159,14 +159,14 @@
                },


The changes here except the new parameters should be reverted!

jfy133 · 2026-03-20T09:49:20Z

nextflow_schema.json

+                },
+                "bgc_bigslice_db": {
+                    "type": "string",
+                    "description": "Path to the pre-downloaded BiG-SLiCE HMM database directory."


A long form help text saying where the database would come from would be nice, and also add the Modifies tool parameter(s):\ text and list the database parameter the database goes into in BiG-SLiCE itself

jfy133 · 2026-03-20T09:53:57Z

docs/usage.md


+### BiGSLiCE
+
+BiG-SLiCE requires its own HMM database. Unlike most other tools, the pipeline does **not** auto-download this database — it **must** be supplied manually with `--bgc_bigslice_db`.


Is there is a reason why you don't do this? It appears there is a tool in the package for this: https://github.com/medema-group/bigslice?tab=readme-ov-file#quick-start

II thin kthis swould be important given the models look like they could change more regularly than the package itself.

Also hardcoding the URL is risky, also given I don't see any further docs on the bigslice on how to get the database files other than the above.

(I htink the 271MB database file might be OK for our test runs, if it downloads fast - but you might need to look inside and see if you make a smaller version.

It looks like there are some docs here on how to do it : https://github.com/medema-group/bigslice/tree/master/bigslice/db/advanced

jfy133 · 2026-03-20T09:56:15Z

conf/modules.config

    }

+    withName: BIGSLICE {
+        publishDir = [


Are there no parameters a user may want to refine here and we should expose to the user at a pipeline level?

Looking here, looks to me tehre are lots of 'threshold' like parameters (e.g. -n_ranks, --threshold etc)

jfy133 · 2026-03-20T10:27:59Z

docs/usage.md

+- GECCO with `--bgc_gecco_runconvert --bgc_gecco_convertmode gbk --bgc_gecco_convertformat bigslice`
+
+BiG-SLiCE does **not** discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input.
+The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.


Suggested change

The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.

The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#databases-and-reference files-for-details); it is not auto-downloaded by the pipeline.

jfy133 · 2026-03-20T10:30:15Z

docs/output.md

+
+[BiG-SLiCE](https://github.com/medema-group/bigslice) (**Bi**osynthetic **G**ene cluster **S**uper-**Li**near **C**lustering **E**ngine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs).
+It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments.
+BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.


I would add asentence after this saying what you wuld use the data.db file for... as it's not a standard thing you would use (arguably)

SkyLexS added 13 commits February 17, 2026 16:07

Implelenting bigslice into funcscan

dc3dc26

preping the input for bigslice

2300eb6

preping the input for bigslice

722d4bb

preping the input for bigslice

69168d3

preping the input for bigslice

e254400

preping the input for bigslice

040dd98

preping the input for bigslice

a690c8b

reverting the version genereting method

0bde3ec

Updating bigslice after the modules updates

4f4f630

Updating bgc versioning for bigslice

9322394

Updating bgc versioning for bigslice

48e6043

linting

879f1c6

Modification of the md files

534dab5

SkyLexS requested review from Darcy220606, jasmezz and jfy133 as code owners March 10, 2026 12:28

SkyLexS and others added 3 commits March 10, 2026 14:44

clearing conflicts

057d142

Merge branch 'dev' into big_slice_imp

afb50a1

trailing whitespaces

925ef7e

jfy133 changed the title ~~Big slice imp~~ Add BiG-SLiCE to BGC subworkflow Mar 20, 2026

jfy133 reviewed Mar 20, 2026

View reviewed changes

Move BigSLICE input validation to pipeline initialisation

59c0733

jfy133 reviewed Mar 20, 2026

View reviewed changes


		[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:

		- antiSMASH (default BGC tool), or

	- antiSMASH (default BGC tool), or
	- antiSMASH (default BGC tool).


		### BiGSLiCE

		[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:

	[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs). It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:
	[BiG-SLiCE](https://github.com/medema-group/bigslice) clusters BGC sequences into Gene Cluster Families (GCFs).
	It is activated with `--bgc_bigslice_run` and requires at least one BGC source to be enabled:

	BiG-SLiCE does not discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input. The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.
	BiG-SLiCE does not discover BGCs itself — it takes GenBank-format BGC regions produced by antiSMASH and/or GECCO convert as input.
	The HMM database must be provided explicitly via `--bgc_bigslice_db` (see [BiGSLiCE database](#bigslice-1) for details); it is not auto-downloaded by the pipeline.


		</details>

		[BiG-SLiCE](https://github.com/medema-group/bigslice) (Biosynthetic Gene cluster Super-Linear Clustering Engine) is a highly scalable tool for the large-scale analysis and clustering of Biosynthetic Gene Clusters (BGCs) into Gene Cluster Families (GCFs). It takes BGC regions in GenBank format (e.g. output from antiSMASH or GECCO) along with an HMM database and produces an SQLite database of predicted BGC features and GCF assignments. BiG-SLiCE requires the HMM database to be supplied via `--bgc_bigslice_db` and is activated with `--bgc_bigslice_run`. It requires at least one of antiSMASH or GECCO (with convert in bigslice format) to be enabled.

	if (params.bgc_bigslice_run) {
	if (params.bgc_run_bigslice) {


		### BiGSLiCE

		BiG-SLiCE requires its own HMM database. Unlike most other tools, the pipeline does not auto-download this database — it must be supplied manually with `--bgc_bigslice_db`.

Conversation

SkyLexS commented Mar 10, 2026 • edited by jfy133 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

nf-core-bot commented Mar 10, 2026

Uh oh!

jfy133 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SkyLexS commented Mar 20, 2026

Uh oh!

jfy133 left a comment

Choose a reason for hiding this comment

Major things

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SkyLexS commented Mar 10, 2026 •

edited by jfy133

Loading