Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation by HaidYi · Pull Request #483 · nf-core/funcscan

HaidYi · 2025-07-02T00:16:24Z

PR checklist

Close #481.

The main changes include:

Like other screening tools, added a dedicated subworkflow (subworkflows/dbcan.nf) for the support of run_dbcan screening.
Added the annotation step for generating the .gff files and added the alias of the current modules (e.g., PYRODIGAL_GFF). So, the input gbk column may also use gff file as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.
Other utilities:
- ci/cd, testing profiles for dbcan, module.config, etc.
- documents: readme and output

Things that are needed the changes from the maintainer:

Add the changelog for this change in the next release version.
Add the dbcan screening step in the schematic workflow.

nf-core-bot · 2025-07-02T00:17:00Z

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

jasmezz

What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.

Some other comments we could consider:

Thinking about renaming the new dbcan subworkflow to cazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.
- This would include changing the output dir in modules.config to ${params.outdir}/cazyme/cazyme_annotation, ${params.outdir}/cazyme/cgc, ${params.outdir}/cazyme/substrate
- file tree in output docs
- test names
- nextflow_schema.json ...
The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.

jasmezz · 2025-07-10T12:37:06Z

conf/test_preannotated_dbcan.config

+    dbcan_skip_cgc             = true   // skip cgc as .gbk is used
+    dbcan_skip_substrate       = true   // skip substrate as .gbk is used


If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.

jasmezz · 2025-07-10T13:22:09Z

docs/output.md

+    - `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation.
+    - `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation.
+    - `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation.
+  - `cgc`


Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).

@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.

jfy133

Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing

Some additional things to my direct comments:

Missing citations.md update
Missing the how to cite/methods text in this file: https://github.com/HaidYi/funcscan/blob/0cad8f95c553b3cdd3a59c34a0db107bd6df14f4/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf#L174
Missing metromap update (but we can probably do this before release)
Missing nf-test test and snapshots for the new tests

conf/test_cazyme_pyrodigal.config

jfy133 · 2025-07-15T06:45:11Z

conf/test_preannotated_dbcan.config

+    run_bgc_screening          = false
+    run_cazyme_screening       = true
+
+    dbcan_skip_cgc             = true   // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet


We should probably add gff files!

You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version

Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.

Can this be updated now you have the file?

jfy133 · 2025-07-15T06:46:43Z

docs/output.md

 |   ├── deepbgc/
 |   ├── gecco/
 |   └── hmmsearch/
+├── dbcan/


The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories

jfy133 · 2025-07-15T06:48:37Z

docs/output.md

+
+- `dbcan/`
+  - `cazyme`
+    - `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation


You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)

docs/usage.md

jfy133 · 2025-07-15T06:55:00Z

subworkflows/local/cazyme.nf

+        .join(ch_gffs_for_rundbcan)
+        .multiMap { meta, faa, gff ->
+            faa: [meta, faa]
+            gff: [meta, gff, 'prodigal']


Is the gff always from prodigal? Or is this a dummy value?

Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

HaidYi · 2025-07-17T03:58:29Z

@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues.

HaidYi · 2026-02-04T15:48:03Z

@jfy133 I updated the rundbcan module to aws for database downloading(nf-core/modules#9768). And this new PR now has no problems for the longtime db downloading problems. Please review again.

jfy133

OK we are ALMOST DONE @HaidYi 🎉! Thank you for your patience!

Here are the last points/questions (to summarise some of the specific comments too), but otherwise code looks great, I've checked against our pipeline conventions (now on dev here and you're already following them already 💪:

Conceptual

Can you confirm there are no db_can <subcmd> options/arguments that we should expose to the user via a pipeline parameter? E.g. for run_dbcan the --mode or --methods parameters? Or for the cgc_finder the parameter --use_distance ?

Code

test_preannotated_cazyme.conf: You are missing a tests nf-test test file and it's snapshot for the new test config

Documentation

usage.md: missing documentation in the sameplsheet section about the new gff column
nextflow_schema.json: missing the long-form helptext(s) describing when you would want to maybe skip the cgc and substrate detection
CHANGELOG.md: missing a change log entry of the PR, but also please make sure to add the version of db_can as a new dependency (i.e., the previous version column can be empty)
README.md: don't forget to add yourself to the 'credits` list!
nextflow.config: don't forget to add yourself to the manifest section as a contributor!

jfy133 · 2026-02-20T12:38:53Z

conf/test_preannotated_cazyme.config

+    dbcan_skip_cgc             = false   // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
+    dbcan_skip_substrate       = false   // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet


Unless the GBK/GFF files are mutually exclusive as input to funcscan, I would argue maybe it would make sense to include the GFF file in the samplesheet_preannotated.csv samplesheet

But it would be nice in another test profile (maybe test_cazyme_prokka) you still also test skipping the dbcan_skip_cgc and dbcan_skip_substrate functionality?

nextflow_schema.json

conf/test_preannotated_cazyme.config

docs/usage.md

jfy133 · 2026-03-14T19:26:05Z

@nf-core-bot fix linting

jfy133 · 2026-03-14T19:26:58Z

Hrm, the test failure error is:

> [7a/0497d8] Submitted process > NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)
    > ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan easy_substrate \
    >       --mode protein \
    >       --db_dir dbcan_db \
    >       --input_raw_data sample_3_pyrodigal.faa \
    >       --output_dir . \
    >       --input_gff sample_3_pyrodigal.gff \
    >       --gff_type prodigal \
    >   
    >   
    >   mv overview.tsv             sample_3_overview.tsv
    >   mv dbCAN_hmm_results.tsv    sample_3_dbCAN_hmm_results.tsv
    >   mv dbCANsub_hmm_results.tsv sample_3_dbCANsub_hmm_results.tsv
    >   mv diamond.out              sample_3_diamond.out
    >   mv cgc.gff                  sample_3_cgc.gff
    >   mv cgc_standard_out.tsv     sample_3_cgc_standard_out.tsv
    >   mv diamond.out.tc           sample_3_diamond.out.tc
    >   mv STP_hmm_results.tsv      sample_3_STP_hmm_results.tsv
    >   mv total_cgc_info.tsv       sample_3_total_cgc_info.tsv
    >   mv CGC.faa                  sample_3_CGC.faa
    >   mv PUL_blast.out            sample_3_PUL_blast.out
    >   mv substrate_prediction.tsv sample_3_substrate_prediction.tsv
    >   mv synteny_pdf/             sample_3_synteny_pdf/
    >   if [ -f TF_hmm_results.tsv ]; then
    >       mv TF_hmm_results.tsv   sample_3_TF_hmm_results.tsv
    >   fi
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Command wrapper:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/work/7a/0497d8588d4be4dfe1db4f5e81ce36
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.2.6--pyhdfd78af_0
    > 
    > Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

Do you expect EASYSUBSTRATE to be so slow @HaidYi ?

HaidYi · 2026-03-15T02:43:24Z

Hrm, the test failure error is:

> [7a/0497d8] Submitted process > NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)
    > ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan easy_substrate \
    >       --mode protein \
    >       --db_dir dbcan_db \
    >       --input_raw_data sample_3_pyrodigal.faa \
    >       --output_dir . \
    >       --input_gff sample_3_pyrodigal.gff \
    >       --gff_type prodigal \
    >   
    >   
    >   mv overview.tsv             sample_3_overview.tsv
    >   mv dbCAN_hmm_results.tsv    sample_3_dbCAN_hmm_results.tsv
    >   mv dbCANsub_hmm_results.tsv sample_3_dbCANsub_hmm_results.tsv
    >   mv diamond.out              sample_3_diamond.out
    >   mv cgc.gff                  sample_3_cgc.gff
    >   mv cgc_standard_out.tsv     sample_3_cgc_standard_out.tsv
    >   mv diamond.out.tc           sample_3_diamond.out.tc
    >   mv STP_hmm_results.tsv      sample_3_STP_hmm_results.tsv
    >   mv total_cgc_info.tsv       sample_3_total_cgc_info.tsv
    >   mv CGC.faa                  sample_3_CGC.faa
    >   mv PUL_blast.out            sample_3_PUL_blast.out
    >   mv substrate_prediction.tsv sample_3_substrate_prediction.tsv
    >   mv synteny_pdf/             sample_3_synteny_pdf/
    >   if [ -f TF_hmm_results.tsv ]; then
    >       mv TF_hmm_results.tsv   sample_3_TF_hmm_results.tsv
    >   fi
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Command wrapper:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/work/7a/0497d8588d4be4dfe1db4f5e81ce36
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.2.6--pyhdfd78af_0
    > 
    > Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

Do you expect EASYSUBSTRATE to be so slow @HaidYi ?

I don't think so. @Xinpeng021001 Could you answer the question why the substrate on this testing dataset needs a long time to finish ?

Xinpeng021001 · 2026-03-15T02:46:00Z

The substrate process shouldn't take such a long time. I'm reviewing the error and will reply asap.

jfy133 · 2026-03-20T08:28:33Z

OK I guess there as some download slowdown @HaidYi @Xinpeng021001 , assuming one of you just restarted the tests?

jfy133

I think this is done, when I'm at my office I will push the update to the tests themselves, and then I will update the diagram and I think we are good to go for a merge!

I'll let you know when it's ready and you can hit the 'merge' button @HaidYi !

jfy133 · 2026-03-20T08:32:18Z

tests/test_preannotated_cazyme.nf.test

+                { assert new File("$outputDir/multiqc/multiqc_report.html").exists() },
+
+                // dbCAN annotation
+                { assert path("$outputDir/cazyme/dbcan/cazyme_annotation/sample_1/sample_1_dbCAN_hmm_results.tsv").text.contains("dbCAN") },


EAch of theses should be wrapped in their own snapshot, so it's easier to know what has changed, if it has (like the final snapshot you have at teh end).

docs/usage.md

CHANGELOG.md

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

jfy133

🎉 🎉 🎉 We finally made it! Sorry that took so long @HaidYi ! (and @Xinpeng021001 !), I appreciate your patience, but I think it was worth it! Thank you for the efforT!

I will do the metromap update in a follow up PR and do a quick release :)

HaidYi and others added 7 commits June 30, 2025 19:22

Add run_dbcan screening

6353679

fix missing gffs

15f2ef5

split dbcan results by meta.id

d5df4a1

rm constraints of annotation tool

f049e2f

add test config for rundbcan

8289bdb

add test profile for rundbcan in ci

d8af5e9

add dbcan in the refs

0a5e505

HaidYi self-assigned this Jul 2, 2025

HaidYi requested review from Darcy220606, jasmezz and jfy133 as code owners July 2, 2025 00:16

HaidYi added the enhancement Improvement for existing functionality label Jul 2, 2025

HaidYi mentioned this pull request Jul 2, 2025

Add rundbcan for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation #481

Closed

Suggestions from code review

01a573a

jasmezz reviewed Jul 10, 2025

View reviewed changes

HaidYi added 5 commits July 14, 2025 23:18

rm duplicate outputs

5c5ec66

add manual dbCAN database download

9fd005c

rename DBCAN to CAZYME

ea4b852

add gff column in samplesheet

62623a5

change run_dbcan_screening to run_cazyme_screening

0cad8f9

jfy133 reviewed Jul 15, 2025

View reviewed changes

HaidYi and others added 4 commits July 16, 2025 19:24

add missing identifier

b76e3a2

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

add missing identifier

0f5863a

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

add missing conda

f2d79d5

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

fix typo

625ced4

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

HaidYi added 3 commits July 16, 2025 23:01

re-organize the outdir structure of cazyme screening

58273f1

add citation

a638f32

add cazyme_skip_dbcan param

a5d692b

HaidYi and others added 4 commits February 3, 2026 13:41

update the default test snap

c0c76dc

remove one assertion for the upgraded package

7a921b4

update the test cazyme pyrodigal snap file

d2e8e94

Merge branch 'dev' into rundbcan

4b5a342

HaidYi requested a review from jfy133 February 4, 2026 16:04

jfy133 added 2 commits February 20, 2026 14:12

Apply suggestions from code review

737db6c

Apply suggestions from code review

3b9bbab

jfy133 reviewed Feb 20, 2026

View reviewed changes

HaidYi added 7 commits March 7, 2026 19:44

fix comments

b6c0aab

update the doc/usage

3c9c118

update contributor

49ef6d9

add help_texts

0618e35

add to contributor

8be0c33

update changelog

ebe8bc4

add more cazyme tests

80f7f1e

[automated] Fix code linting

b54a6e7

Apply suggestion from @jfy133

97c9bb9

jfy133 reviewed Mar 20, 2026

View reviewed changes

jfy133 added 3 commits March 20, 2026 11:35

Apply suggestions from code review

8103a8a

Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>

Restructure tests

419e389

Merge branch 'dev' into rundbcan

35f9c4b

jfy133 self-requested a review March 23, 2026 12:25

jfy133 approved these changes Mar 23, 2026

View reviewed changes

jfy133 merged commit b6fe2c0 into nf-core:dev Mar 23, 2026
64 of 66 checks passed

		dbcan_skip_cgc = true // skip cgc as .gbk is used
		dbcan_skip_substrate = true // skip substrate as .gbk is used

		dbcan_skip_cgc = false // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
		dbcan_skip_substrate = false // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet

Conversation

HaidYi commented Jul 2, 2025

PR checklist

Uh oh!

nf-core-bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasmezz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jfy133 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HaidYi commented Jul 17, 2025

Uh oh!

HaidYi commented Feb 4, 2026

Uh oh!

jfy133 left a comment

Choose a reason for hiding this comment

Conceptual

Code

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jfy133 commented Mar 14, 2026

Uh oh!

jfy133 commented Mar 14, 2026

Uh oh!

HaidYi commented Mar 15, 2026

Uh oh!

Xinpeng021001 commented Mar 15, 2026

Uh oh!

jfy133 commented Mar 20, 2026

Uh oh!

jfy133 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jfy133 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

nf-core-bot commented Jul 2, 2025 •

edited

Loading

jfy133 left a comment •

edited

Loading