Add run_dbcan screening for the CAZyme (carbohydrate active enzyme) and CGC (CAZyme Gene Cluster) annotation#483
Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.5.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
jasmezz
left a comment
There was a problem hiding this comment.
What a great addition! @HaidYi I really appreciate your effort, your PR is really clear and on point. Thank you very much for this contribution. During review I directly pushed some minor changes to your fork.
Some other comments we could consider:
- Thinking about renaming the new
dbcansubworkflow tocazyme. This would be more in line with previous naming, i.e. subworkflow names tell the purpose, not the tool.- This would include changing the output dir in
modules.configto${params.outdir}/cazyme/cazyme_annotation,${params.outdir}/cazyme/cgc,${params.outdir}/cazyme/substrate - file tree in output docs
- test names
- nextflow_schema.json ...
- This would include changing the output dir in
- The database download takes very long because of low download rate (>2 GB at at rate of ~ 1 MB/s). That is too long for the test profiles; we need to create a smaller database somehow...
- Adding manual dbCAN database download (via bioconda) to the respective section in usage docs.
conf/test_preannotated_dbcan.config
Outdated
| dbcan_skip_cgc = true // skip cgc as .gbk is used | ||
| dbcan_skip_substrate = true // skip substrate as .gbk is used |
There was a problem hiding this comment.
If we want to be able to run the complete CAZyme subworkflow with pre-annotated .gff files while also providing pre-annotated .gbk files for other subworkflows, we need an additional (optional) column in the samplesheet.
docs/output.md
Outdated
| - `*_dbCAN_hmm_results.tsv`: TSV file containing the detailed dbCAN HMM results for CAZyme annotation. | ||
| - `*_dbCANsub_hmm_results.tsv`: TSV file containing the detailed dbCAN subfamily results for CAZyme annotation. | ||
| - `*_diamond.out`: TSV file containing the detailed dbCAN diamond results for CAZyme annotation. | ||
| - `cgc` |
There was a problem hiding this comment.
Many of the files of the cgc and substrate section seem duplicated. Maybe we don't need to store those which are created in the cazyme step already? Can control this in modules.config (e.g. see RGI_MAIN entry).
There was a problem hiding this comment.
@jasmezz Thank you for reviewing the codes. I will revise it based on your comments.
jfy133
left a comment
There was a problem hiding this comment.
Really good first PR @HaidYi ! Clean and pretty much all of my comments are sort of minor/just polishing
Some additional things to my direct comments:
- Missing
citations.mdupdate - Missing the how to cite/methods text in this file: https://github.com/HaidYi/funcscan/blob/0cad8f95c553b3cdd3a59c34a0db107bd6df14f4/subworkflows/local/utils_nfcore_funcscan_pipeline/main.nf#L174
- Missing metromap update (but we can probably do this before release)
- Missing nf-test test and snapshots for the new tests
conf/test_preannotated_dbcan.config
Outdated
| run_bgc_screening = false | ||
| run_cazyme_screening = true | ||
|
|
||
| dbcan_skip_cgc = true // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet |
There was a problem hiding this comment.
We should probably add gff files!
You can generate them from a normal funcscan fun, and make a PR against teh funscan branch of nf-core/testdatasets, which has the files and an updated samplesheet for the next funcscan version
There was a problem hiding this comment.
Yes, currently the cazyme screening can only use the .gff files in the pipeline. To use the pre-annotated one, I generated the .gff files from pyrodigal. The PR can be found at nf-core/test-datasets#1683.
There was a problem hiding this comment.
Can this be updated now you have the file?
docs/output.md
Outdated
| | ├── deepbgc/ | ||
| | ├── gecco/ | ||
| | └── hmmsearch/ | ||
| ├── dbcan/ |
There was a problem hiding this comment.
The top level should be the molecule/gene type (i.e., cazyme), then a subdirectory with each tool (in this case dbcan), and within that each of the different output directories
docs/output.md
Outdated
|
|
||
| - `dbcan/` | ||
| - `cazyme` | ||
| - `*_overview.tsv`: TSV file containing the results of dbCAN CAZyme annotation |
There was a problem hiding this comment.
You're missing the <sample.id> sample subdirectory underneath the tool name (accoeding to your modules.confg)
subworkflows/local/cazyme.nf
Outdated
| .join(ch_gffs_for_rundbcan) | ||
| .multiMap { meta, faa, gff -> | ||
| faa: [meta, faa] | ||
| gff: [meta, gff, 'prodigal'] |
There was a problem hiding this comment.
Is the gff always from prodigal? Or is this a dummy value?
There was a problem hiding this comment.
Refer to the module description: https://nf-co.re/modules/rundbcan_easycgc/. If it's the generated in the pipeline, it is always the prodigal. But if it's provided using the pre-annotated one, then it could be either NCBI_prok, JGI, NCBI_euk or prodigal. This makes things complicated. An easier way is to define a parameter in the cli for this option but it's kind of hard to deal with the mixed case in a batch without doing the modifications in the samplesheet.
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
|
@jfy133 Thank you for the comments and suggestions. I will fix all the problems one-by-one. As I don't want this PR corrupt other screening steps, I will do a more comprehensive testing, which may take more time. I will let you know when I fix all the issues. |
|
@jfy133 I updated the rundbcan module to aws for database downloading(nf-core/modules#9768). And this new PR now has no problems for the longtime db downloading problems. Please review again. |
jfy133
left a comment
There was a problem hiding this comment.
OK we are ALMOST DONE @HaidYi 🎉! Thank you for your patience!
Here are the last points/questions (to summarise some of the specific comments too), but otherwise code looks great, I've checked against our pipeline conventions (now on dev here and you're already following them already 💪:
Conceptual
- Can you confirm there are no
db_can <subcmd>options/arguments that we should expose to the user via a pipeline parameter? E.g. forrun_dbcanthe--modeor--methodsparameters? Or for thecgc_finderthe parameter--use_distance?
Code
test_preannotated_cazyme.conf: You are missing atestsnf-test test file and it's snapshot for the new test config
Documentation
usage.md: missing documentation in the sameplsheet section about the newgffcolumnnextflow_schema.json: missing the long-form helptext(s) describing when you would want to maybe skip thecgcandsubstratedetectionCHANGELOG.md: missing a change log entry of the PR, but also please make sure to add the version ofdb_canas a new dependency (i.e., the previous version column can be empty)README.md: don't forget to add yourself to the 'credits` list!nextflow.config: don't forget to add yourself to the manifest section as acontributor!
conf/test_preannotated_cazyme.config
Outdated
| dbcan_skip_cgc = false // Skip cgc annotation as .gbk (not .gff) is provided in samplesheet | ||
| dbcan_skip_substrate = false // Skip substrate annotation as .gbk (not .gff) is provided in samplesheet |
There was a problem hiding this comment.
Unless the GBK/GFF files are mutually exclusive as input to funcscan, I would argue maybe it would make sense to include the GFF file in the samplesheet_preannotated.csv samplesheet
But it would be nice in another test profile (maybe test_cazyme_prokka) you still also test skipping the dbcan_skip_cgc and dbcan_skip_substrate functionality?
|
@nf-core-bot fix linting |
|
Hrm, the test failure error is: Do you expect EASYSUBSTRATE to be so slow @HaidYi ? |
I don't think so. @Xinpeng021001 Could you answer the question why the substrate on this testing dataset needs a long time to finish ? |
|
The substrate process shouldn't take such a long time. I'm reviewing the error and will reply asap. |
|
OK I guess there as some download slowdown @HaidYi @Xinpeng021001 , assuming one of you just restarted the tests? |
jfy133
left a comment
There was a problem hiding this comment.
I think this is done, when I'm at my office I will push the update to the tests themselves, and then I will update the diagram and I think we are good to go for a merge!
I'll let you know when it's ready and you can hit the 'merge' button @HaidYi !
| { assert new File("$outputDir/multiqc/multiqc_report.html").exists() }, | ||
|
|
||
| // dbCAN annotation | ||
| { assert path("$outputDir/cazyme/dbcan/cazyme_annotation/sample_1/sample_1_dbCAN_hmm_results.tsv").text.contains("dbCAN") }, |
There was a problem hiding this comment.
EAch of theses should be wrapped in their own snapshot, so it's easier to know what has changed, if it has (like the final snapshot you have at teh end).
Co-authored-by: James A. Fellows Yates <jfy133@gmail.com>
There was a problem hiding this comment.
🎉 🎉 🎉 We finally made it! Sorry that took so long @HaidYi ! (and @Xinpeng021001 !), I appreciate your patience, but I think it was worth it! Thank you for the efforT!
I will do the metromap update in a follow up PR and do a quick release :)
PR checklist
Close #481.
The main changes include:
subworkflows/dbcan.nf) for the support of run_dbcan screening..gfffiles and added the alias of the current modules (e.g.,PYRODIGAL_GFF). So, the inputgbkcolumn may also usegfffile as input. Feel free to change this part as it may need some tweaks considering the both the pipeline and the document.module.config, etc.readmeandoutputThings that are needed the changes from the maintainer:
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).