Skip to content

Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation#483

Merged
jfy133 merged 81 commits intonf-core:devfrom
HaidYi:rundbcan
Mar 23, 2026
Merged

Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation#483
jfy133 merged 81 commits intonf-core:devfrom
HaidYi:rundbcan

Conversation

@HaidYi
Copy link

@HaidYi HaidYi commented Jul 2, 2025

PR checklist

Close #481.

The main changes include:

  • Like other screening tools, added a dedicated subworkflow (subworkflows/dbcan.nf) for the support of run_dbcan screening.
  • Added the annotation step for generating the .gff files and added the alias of the current modules (e.g., PYRODIGAL_GFF). So, the input gbk column may also use gff file as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.
  • Other utilities:
    • ci/cd, testing profiles for dbcan, module.config, etc.
    • documents: readme and output

Things that are needed the changes from the maintainer:

  • Add the changelog for this change in the next release version.
  • Add the dbcan screening step in the schematic workflow.

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/funcscan branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@HaidYi HaidYi self-assigned this Jul 2, 2025
@HaidYi HaidYi added the enhancement Improvement for existing functionality label Jul 2, 2025
@nf-core-bot
Copy link
Member

nf-core-bot commented Jul 2, 2025

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.5.1.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copy link
Collaborator

@jasmezz jasmezz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.

Some other comments we could consider:

  • Thinking about renaming the new dbcan subworkflow to cazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.
    • This would include changing the output dir in modules.config to ${params.outdir}/cazyme/cazyme_annotation, ${params.outdir}/cazyme/cgc, ${params.outdir}/cazyme/substrate
    • file tree in output docs
    • test names
    • nextflow_schema.json ...
  • The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
  • Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.

Comment on lines +35 to +36
dbcan_skip_cgc = true // skip cgc as .gbk is used
dbcan_skip_substrate = true // skip substrate as .gbk is used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.

docs/output.md Outdated
- `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation.
- `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation.
- `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation.
- `cgc`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing

Some additional things to my direct comments:

run_bgc_screening = false
run_cazyme_screening = true

dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add gff files!

You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be updated now you have the file?

docs/output.md Outdated
| ├── deepbgc/
| ├── gecco/
| └── hmmsearch/
├── dbcan/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories

docs/output.md Outdated

- `dbcan/`
- `cazyme`
- `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)

.join(ch_gffs_for_rundbcan)
.multiMap { meta, faa, gff ->
faa: [meta, faa]
gff: [meta, gff, 'prodigal']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the gff always from prodigal? Or is this a dummy value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.

HaidYi and others added 4 commits July 16, 2025 19:24
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
@HaidYi
Copy link
Author

HaidYi commented Jul 17, 2025

@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues.

@HaidYi
Copy link
Author

HaidYi commented Feb 4, 2026

@jfy133 I updated the rundbcan module to aws for database downloading(nf-core/modules#9768). And this new PR now has no problems for the longtime db downloading problems. Please review again.

@HaidYi HaidYi requested a review from jfy133 February 4, 2026 16:04
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK we are ALMOST DONE @HaidYi 🎉! Thank you for your patience!

Here are the last points/questions (to summarise some of the specific comments too), but otherwise code looks great, I've checked against our pipeline conventions (now on dev here and you're already following them already 💪:

Conceptual

  1. Can you confirm there are no db_can <subcmd> options/arguments that we should expose to the user via a pipeline parameter? E.g. for run_dbcan the --mode or --methods parameters? Or for the cgc_finder the parameter --use_distance ?

Code

  1. test_preannotated_cazyme.conf: You are missing a tests nf-test test file and it's snapshot for the new test config

Documentation

  1. usage.md: missing documentation in the sameplsheet section about the new gff column
  2. nextflow_schema.json: missing the long-form helptext(s) describing when you would want to maybe skip the cgc and substrate detection
  3. CHANGELOG.md: missing a change log entry of the PR, but also please make sure to add the version of db_can as a new dependency (i.e., the previous version column can be empty)
  4. README.md: don't forget to add yourself to the 'credits` list!
  5. nextflow.config: don't forget to add yourself to the manifest section as a contributor!

Comment on lines +35 to +36
dbcan_skip_cgc = false // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet
dbcan_skip_substrate = false // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the GBK/GFF files are mutually exclusive as input to funcscan, I would argue maybe it would make sense to include the GFF file in the samplesheet_preannotated.csv samplesheet

But it would be nice in another test profile (maybe test_cazyme_prokka) you still also test skipping the dbcan_skip_cgc and dbcan_skip_substrate functionality?

@jfy133
Copy link
Member

jfy133 commented Mar 14, 2026

@nf-core-bot fix linting

@jfy133
Copy link
Member

jfy133 commented Mar 14, 2026

Hrm, the test failure error is:

> [7a/0497d8] Submitted process > NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)
    > ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan easy_substrate \
    >       --mode protein \
    >       --db_dir dbcan_db \
    >       --input_raw_data sample_3_pyrodigal.faa \
    >       --output_dir . \
    >       --input_gff sample_3_pyrodigal.gff \
    >       --gff_type prodigal \
    >   
    >   
    >   mv overview.tsv             sample_3_overview.tsv
    >   mv dbCAN_hmm_results.tsv    sample_3_dbCAN_hmm_results.tsv
    >   mv dbCANsub_hmm_results.tsv sample_3_dbCANsub_hmm_results.tsv
    >   mv diamond.out              sample_3_diamond.out
    >   mv cgc.gff                  sample_3_cgc.gff
    >   mv cgc_standard_out.tsv     sample_3_cgc_standard_out.tsv
    >   mv diamond.out.tc           sample_3_diamond.out.tc
    >   mv STP_hmm_results.tsv      sample_3_STP_hmm_results.tsv
    >   mv total_cgc_info.tsv       sample_3_total_cgc_info.tsv
    >   mv CGC.faa                  sample_3_CGC.faa
    >   mv PUL_blast.out            sample_3_PUL_blast.out
    >   mv substrate_prediction.tsv sample_3_substrate_prediction.tsv
    >   mv synteny_pdf/             sample_3_synteny_pdf/
    >   if [ -f TF_hmm_results.tsv ]; then
    >       mv TF_hmm_results.tsv   sample_3_TF_hmm_results.tsv
    >   fi
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Command wrapper:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/work/7a/0497d8588d4be4dfe1db4f5e81ce36
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.2.6--pyhdfd78af_0
    > 
    > Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

Do you expect EASYSUBSTRATE to be so slow @HaidYi ?

@HaidYi
Copy link
Author

HaidYi commented Mar 15, 2026

Hrm, the test failure error is:

> [7a/0497d8] Submitted process > NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)
    > ERROR ~ Error executing process > 'NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE (sample_3)'
    > 
    > Caused by:
    >   Process exceeded running time limit (1h)
    > 
    > 
    > Command executed:
    > 
    >   run_dbcan easy_substrate \
    >       --mode protein \
    >       --db_dir dbcan_db \
    >       --input_raw_data sample_3_pyrodigal.faa \
    >       --output_dir . \
    >       --input_gff sample_3_pyrodigal.gff \
    >       --gff_type prodigal \
    >   
    >   
    >   mv overview.tsv             sample_3_overview.tsv
    >   mv dbCAN_hmm_results.tsv    sample_3_dbCAN_hmm_results.tsv
    >   mv dbCANsub_hmm_results.tsv sample_3_dbCANsub_hmm_results.tsv
    >   mv diamond.out              sample_3_diamond.out
    >   mv cgc.gff                  sample_3_cgc.gff
    >   mv cgc_standard_out.tsv     sample_3_cgc_standard_out.tsv
    >   mv diamond.out.tc           sample_3_diamond.out.tc
    >   mv STP_hmm_results.tsv      sample_3_STP_hmm_results.tsv
    >   mv total_cgc_info.tsv       sample_3_total_cgc_info.tsv
    >   mv CGC.faa                  sample_3_CGC.faa
    >   mv PUL_blast.out            sample_3_PUL_blast.out
    >   mv substrate_prediction.tsv sample_3_substrate_prediction.tsv
    >   mv synteny_pdf/             sample_3_synteny_pdf/
    >   if [ -f TF_hmm_results.tsv ]; then
    >       mv TF_hmm_results.tsv   sample_3_TF_hmm_results.tsv
    >   fi
    >   
    >   cat <<-END_VERSIONS > versions.yml
    >   "NFCORE_FUNCSCAN:FUNCSCAN:CAZYME:RUNDBCAN_EASYSUBSTRATE":
    >       dbcan: $(echo $(run_dbcan version) | cut -f2 -d':' | cut -f2 -d' ')
    >   END_VERSIONS
    > 
    > Command exit status:
    >   -
    > 
    > Command output:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Command wrapper:
    >   step 1/4  CAZyme annotation...
    >   step 2/4  GFF processing...
    > 
    > Work dir:
    >   /home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/work/7a/0497d8588d4be4dfe1db4f5e81ce36
    > 
    > Container:
    >   quay.io/biocontainers/dbcan:5.2.6--pyhdfd78af_0
    > 
    > Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
    > 
    >  -- Check '/home/runner/_work/funcscan/funcscan/~/tests/663aa07f0ba76014e15144d46d75baaa/meta/nextflow.log' file for details
    > Execution cancelled -- Finishing pending tasks before exit
    > ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

Do you expect EASYSUBSTRATE to be so slow @HaidYi ?

I don't think so. @Xinpeng021001 Could you answer the question why the substrate on this testing dataset needs a long time to finish ?

@Xinpeng021001
Copy link
Member

The substrate process shouldn't take such a long time. I'm reviewing the error and will reply asap.

@jfy133
Copy link
Member

jfy133 commented Mar 20, 2026

OK I guess there as some download slowdown @HaidYi @Xinpeng021001 , assuming one of you just restarted the tests?

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is done, when I'm at my office I will push the update to the tests themselves, and then I will update the diagram and I think we are good to go for a merge!

I'll let you know when it's ready and you can hit the 'merge' button @HaidYi !

{ assert new File("$outputDir/multiqc/multiqc_report.html").exists() },

// dbCAN annotation
{ assert path("$outputDir/cazyme/dbcan/cazyme_annotation/sample_1/sample_1_dbCAN_hmm_results.tsv").text.contains("dbCAN") },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EAch of theses should be wrapped in their own snapshot, so it's easier to know what has changed, if it has (like the final snapshot you have at teh end).

@jfy133 jfy133 self-requested a review March 23, 2026 12:25
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🎉 🎉 We finally made it! Sorry that took so long @HaidYi ! (and @Xinpeng021001 !), I appreciate your patience, but I think it was worth it! Thank you for the efforT!

I will do the metromap update in a follow up PR and do a quick release :)

@jfy133 jfy133 merged commit b6fe2c0 into nf-core:dev Mar 23, 2026
64 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants