Skip to content
Draft
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
53aa47c
Install diamond/blastp
eweizy Mar 26, 2025
55b7632
Unfinished diamond/blastp integraton
eweizy Mar 26, 2025
3c5b661
Add diamond
eweizy Mar 26, 2025
b1fdc74
reinstalled diamond/blastp module. Installed blast/makeblastdb
tracelail May 21, 2025
9f1ea67
wrote draft integration of BLAST_MAKEBLASTDB and NCBIREFSEQDOWNLOAD i…
tracelail Jun 2, 2025
52506f7
installed diamond makedb
tracelail Jun 3, 2025
49ee17c
cleared nf-test logs and added tuple output for main.nf.test of diam…
tracelail Jun 3, 2025
6ab82f3
Merge branch 'dev' of https://github.com/tracelail/proteinannotator i…
tracelail Jun 3, 2025
89fb03e
Merge branch 'dev' of https://github.com/nf-core/proteinannotator int…
tracelail Jun 3, 2025
f12c619
removed blast/makeblastdb nf-core module and create local diamondprep…
tracelail Jun 10, 2025
4f4db82
finished up writing the ncbirefseqdownload process for the first draf…
tracelail Jun 17, 2025
3946ba5
edited ncbirefseqdownload script to a working stated where the nf-tes…
tracelail Jun 18, 2025
94ebe1b
created a working ncbirefseqdownload module with basic nf-test. Also …
tracelail Jun 26, 2025
06bf5e8
added more working tests to ncbirefseqdownload and organized.
tracelail Jun 27, 2025
7681f82
Added diamondpreparetaxa main.nf script and a process.success nf-test.
tracelail Jun 30, 2025
0cf27f7
Merge branch 'dev' into unfinished-diamond-blastp
tracelail Jul 1, 2025
92c847c
added working snapshot assertions for process.out match and versions …
tracelail Jul 2, 2025
043451f
Added output documentation for all seven Diamond subworkflow outputs.
tracelail Jul 7, 2025
f59c0b7
wrote a potential subworkflow for diamond as well as a test. Copied t…
tracelail Jul 9, 2025
cf259d3
added stub portion to ncbirefseqdownload. Added all output emits to d…
tracelail Jul 9, 2025
9fd5d3f
Added stub section to diamondpreparetaxa module.
tracelail Jul 9, 2025
fa390ce
created a simple flow diagram of the diamond subworkflow and it's mod…
tracelail Jul 9, 2025
4eb5ad1
Apply suggestions from code review
tracelail Jul 28, 2025
71ff9ef
Created workflow success tests for diamond subworkflow. Added nextflo…
tracelail Jul 29, 2025
a4f00be
Merge branch 'dev' of https://github.com/nf-core/proteinannotator int…
tracelail Jul 29, 2025
16e10b2
Merge remote-tracking branch 'refs/remotes/origin/unfinished-diamond-…
tracelail Jul 29, 2025
c970041
corrected typo in diamondpreparetaxa container
tracelail Jul 30, 2025
925ed89
updated diamond/makedb module
tracelail Jul 30, 2025
7b8a6e1
removed params.diamond_blast_columns = 'qseqid' to resolve testing co…
tracelail Jul 31, 2025
1f5ba7a
updated nf-core module diamond/blastp to match diamond/makedb
tracelail Jul 31, 2025
8edc405
working subworkflow nf-test with large prot.accession2taxid.gz taxonmap.
tracelail Aug 6, 2025
a3e661c
Updated diamond subworkflow main.nf.test with a smaller taxonmap for …
tracelail Aug 7, 2025
3fbcc6a
created a local diamond subworkflow that produces a diamond/blastp ou…
tracelail Aug 21, 2025
a4911fc
Added example outputs for DIAMOND subworkflow. Added default paramete…
tracelail Aug 26, 2025
64ed6dc
Added usage documentation for DIAMOND subworkflow.
tracelail Aug 26, 2025
556b3e3
Updated nextflow_schema and readme.
tracelail Aug 26, 2025
f5b63c2
minimal label edit to functional annotation workflow.
tracelail Aug 26, 2025
6cd3a20
updated nextflow_schema and nextflow.config
tracelail Aug 26, 2025
3ffbc1a
changed diamond_blast_columns values to null for string inputs
tracelail Oct 2, 2025
0b1df66
added meta.yml info and some tags for functional annotation subworkflow
tracelail Oct 2, 2025
2257719
made stub updates to diamondpreparetaxa and ncbirefseqdownload and ad…
tracelail Oct 2, 2025
7632328
Addeded diamond_blast_columns = "" back as being null caused issues. …
tracelail Oct 3, 2025
c1f0c63
deleted interproscan functional annotation subworkflow tests to remov…
tracelail Oct 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .nf-test.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Jul-09 09:36:15.278 [main] INFO com.askimed.nf.test.App - nf-test 0.9.2
Jul-09 09:36:15.294 [main] INFO com.askimed.nf.test.App - Arguments: [test, subworkflows/local/diamond/tests/main.nf.tests]
Jul-09 09:36:16.153 [main] INFO com.askimed.nf.test.App - Nextflow Version: 24.10.6
Jul-09 09:36:16.155 [main] INFO com.askimed.nf.test.commands.RunTestsCommand - Load config from file /home/trace/projects/proteinannotator/nf-test.config...
Jul-09 09:36:16.663 [main] WARN com.askimed.nf.test.nextflow.NextflowScript - Module /home/trace/projects/proteinannotator/subworkflows/local/functional_annotation/main.nf: Dependency '/home/trace/projects/proteinannotator/subworkflows/local/functional_annotation/../../../modules/nf-core/blast/makeblastdb/main.nf' not found.
Jul-09 09:36:16.728 [main] INFO com.askimed.nf.test.lang.dependencies.DependencyResolver - Loaded 21 files from directory /home/trace/projects/proteinannotator in 0.081 sec
Jul-09 09:36:16.730 [main] INFO com.askimed.nf.test.lang.dependencies.DependencyResolver - Found 0 files containing tests.
Jul-09 09:36:16.730 [main] DEBUG com.askimed.nf.test.lang.dependencies.DependencyResolver - Found files: []
Jul-09 09:36:16.732 [main] INFO com.askimed.nf.test.commands.RunTestsCommand - Found 0 tests to execute.
5 changes: 4 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
{
"markdown.styles": ["public/vscode_markdown.css"]
"markdown.styles": [
"public/vscode_markdown.css"
],
"nextflow.telemetry.enabled": true
}
8 changes: 8 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

> Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [DIAMOND](https://github.com/bbuchfink/diamond)

> Buchfink B, Xie C, Huson DH, "Fast and sensitive protein alignment using DIAMOND", Nature Methods 12, 59-60 (2015). doi:10.1038/nmeth.3176

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,15 @@
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->



1. Run ([`seqkit stats`](https://bioinf.shenwei.me/seqkit/usage/#stats)) to summarize input protein fasta files
2. Functional Annotation:
1. ([`InterProScan`](https://interproscan-docs.readthedocs.io/en/v5/)) a software tool used to analyze protein sequences by scanning them against the signatures of protein families, domains, and sites in the [InterPro](https://www.ebi.ac.uk/interpro/) database, helping to identify their functional characteristics.
2. ([`DIAMOND`](https://github.com/bbuchfink/diamond))
3. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))


![Protein annotator metromap. Protein fasta files are summarized with `seqkit stats`, then functionally annotated with InterProScan, DIAMOND-blastp, UniFire, and Kmerseek](assets/proteinannotator-metromap.excalidraw.png)

## Usage
Expand Down
112 changes: 111 additions & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d

- [Functional Annotation](#functional-annotation) Annotate proteins with functional domains
- [InterProScan](#Interproscan) - Search the InterPro database for functional domains
- [Diamond] (#Diamond) - Provide ‘hits’ of potential homologous protein matches between species
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [SeqKit stats](#seqkit_stats) - Simple statistics for protein FASTA files
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
Expand Down Expand Up @@ -75,7 +76,7 @@ AKRLERIETINREIIDMAGGAGSSNGTGGMLTKIKAATIATESGVPVYICS

</details>

#### JavaScript Object Notation (JSON) Output
##### JavaScript Object Notation (JSON) Output

JSON representation of the matches - an alternative to XML format. As new releases are made public, the changes to the expected JSON format are documented in [Change log for InterProScan JSON output format](https://interproscan-docs.readthedocs.io/en/v5/JSONOutputFormatHistory.html#change-log-for-interproscan-json-output-format).

Expand Down Expand Up @@ -268,6 +269,115 @@ The XML Schema Definition (XSD) is available [here](http://ftp.ebi.ac.uk/pub/sof

</details>

#### Diamond

<details markdown="1">
<summary>Output files</summary>

- `functional_annotation/diamond`
- `*.blast`: (Basic Local Alignment Search Tool) BLAST pairwise format
- `*.xml`: BLAST Extensible Markup Language (XML) format
- `*.txt`: BLAST tabular format (default). This format can be customized, the 6 may be followed by a space-separated list of the blast_columns keywords, each specifying a field of the output.
- `*.daa`: DIAMOND alignment archive (DAA). The DAA format is a proprietary binary format that can subsequently be used to generate other output formats using the view command. It is also supported by MEGAN and allows a quick import of results.
- `*.sam`: SAM format.
- `*.tsv`: Taxonomic classification. This format will not print alignments but only a taxonomic classification for each query using the LCA algorithm.
- `*.paf`: PAF format. The custom fields in the format are AS (bit score), ZR (raw score) and ZE (e-value)

</details>

[Diamond](https://github.com/bbuchfink/diamond) provides sensitive protein sequence alignment. The process provides ‘hits’ that are potential homologous protein matches between species, indicating a evolutionary relationship, derived by protein sequence similarity.

##### Pairwise Alignment Format (.blast) Output

The pairwise BLAST format is a human readable format that is useful for visual inspection, if one desires to get full alignment details for individual alignments.

<details markdown="1">
<summary>Example Pairwise Alignment Format output</summary>

```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing these example outputs will be filled in, correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. My thought process was to provide the output from running the test once I have it working.

```

</details>

##### BLAST Extensible Markup Language (XML) Output

XML (Extensible Markup Language) file has the same information as the pairwise file but is suited for bioinformatics software and scripts (machine readable), due to it’s structure and parsing of data.

<details markdown="1">
<summary>Example Extensible Markup Language (XML) output</summary>

```

```

</details>

##### Text File (TXT) Output --default

The BLAST tabular format is the default output and the output columns can be modified depending on analysis needs. This format is much smaller than the other BLAST formats and compatible with most all forward processing and is easily filtered and analyzed.

<details markdown="1">
<summary>Example Text File (TXT) output</summary>

```

```

</details>

##### DIAMOND Alignment Archive (DAA) Output

DIAMOND alignment archive (DAA) is a compressed proprietary binary format that is can be converted to any of the other output formats (.blast, .xml, .txt, .sam, .tsv, .paf) with the DIAMOND view command without rerunning the pipeline. It can also be used in some meta-genomic analysis software.

<details markdown="1">
<summary>Example DIAMOND Alignment Archive (DAA) output</summary>

```

```

</details>

##### Sequence Alignment/Map (SAM) Output

The SAM (Sequence Alignment/Map) file adapts the DIAMOND protein alignment output in a similar fashion to the genomic alignment. This allows for easy integration into SAM/BAM pipelines and protein alignment visualization with IGV browser.

<details markdown="1">
<summary>Example Sequence Alignment/Map (SAM) output</summary>

```

```

</details>

##### Tab-Separated Values (TSV) Output

The taxonomic classification (.tsv) output provides taxonomic composition and is useful for biological interpretation rather than alignment comparison.

<details markdown="1">
<summary>Example Tab-Separated Values (TSV) output</summary>

```

```

</details>

##### Pairwise Mapping Format (PAF)

The PAF (Pairwise mApping Format) file that is originally used for long read sequencing. DIAMOND adds three additional variables, AS (bit score), ZR (raw alignment score), and ZE (E-value), to provide statistical evidence for protein alignment. This format is useful if one is looking for positional information and statistical significance.

<details markdown="1">
<summary>Example InterProScan GFF output</summary>

```

```

</details>

### MultiQC

<details markdown="1">
Expand Down
36 changes: 24 additions & 12 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,30 @@
"https://github.com/nf-core/modules.git": {
"modules": {
"nf-core": {
"mmseqs/search": {
"diamond/blastp": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this is a bit concerning because ALL OF mmseqs/search, mtmalign/align (dev branch version) and diamond/blastp and diamond/makedb modules should be present. Did something happen during the merge with dev branch?

"branch": "master",
"git_sha": "81880787133db07d9b4c1febd152c090eb8325dc",
"installed_by": ["modules"]
"git_sha": "05954dab2ff481bcb999f24455da29a5828af08d",
"installed_by": [
"modules"
]
},
"mtmalign/align": {
"diamond/makedb": {
"branch": "master",
"git_sha": "c7cfb9446fb3098e525089198ff232d795c20ef2",
"installed_by": ["modules"]
"git_sha": "05954dab2ff481bcb999f24455da29a5828af08d",
"installed_by": [
"modules"
]
},
"multiqc": {
"branch": "master",
"git_sha": "f0719ae309075ae4a291533883847c3f7c441dad",
"installed_by": ["modules"]
"installed_by": [
"modules"
]
},
"seqkit/stats": {
"branch": "master",
"git_sha": "81880787133db07d9b4c1febd152c090eb8325dc",
"git_sha": "81880787133db07d9b4c1febd152c090eb8325dc
"installed_by": ["modules"]
},
"untar": {
Expand All @@ -37,20 +43,26 @@
"utils_nextflow_pipeline": {
"branch": "master",
"git_sha": "c2b22d85f30a706a3073387f30380704fcae013b",
"installed_by": ["subworkflows"]
"installed_by": [
"subworkflows"
]
},
"utils_nfcore_pipeline": {
"branch": "master",
"git_sha": "51ae5406a030d4da1e49e4dab49756844fdd6c7a",
"installed_by": ["subworkflows"]
"installed_by": [
"subworkflows"
]
},
"utils_nfschema_plugin": {
"branch": "master",
"git_sha": "2fd2cd6d0e7b273747f32e465fdc6bcc3ae0814e",
"installed_by": ["subworkflows"]
"installed_by": [
"subworkflows"
]
}
}
}
}
}
}
}
10 changes: 10 additions & 0 deletions modules/local/diamondpreparetaxa/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json
channels:
- conda-forge
- bioconda
dependencies:
# TODO nf-core: List required Conda package(s).
# Software MUST be pinned to channel (i.e. "bioconda"), version (i.e. "1.10").
# For Conda, the build (i.e. "h9402c20_2") must be EXCLUDED to support installation on different operating systems.
- "YOUR-TOOL-HERE"
57 changes: 57 additions & 0 deletions modules/local/diamondpreparetaxa/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
process DIAMONDPREPARETAXA {

// tag "${taxondmp_zip.baseName}"
label 'process_low'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/YOUR-TOOL-HERE':
'biocontainers/YOUR-TOOL-HERE' }"

// write the output files to a user specified directory via an input parameter
// publishDir "${params.outdir}/ncbi_refseq/", mode: 'copy'

input:
val taxondmp_zip // Add default of ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

output:
path("taxa/nodes.dmp"), emit: taxonnodes
path("taxa/names.dmp"), emit: taxonnames
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
// def prefix = task.ext.prefix ?: "${meta.id}"
// Omitting from script portion for now
// # $args \\
// # -@ $task.cpus \\
// # -o ${prefix}.bam \\

"""
mkdir -p taxa/
wget -q ${taxondmp_zip}
tar -xzf taxdump.tar.gz -C taxa

cat <<-END_VERSIONS > versions.yml
"${task.process}":
diamondpreparetaxa: \$(diamondpreparetaxa --version)
END_VERSIONS
"""

stub:
// def args = task.ext.args ?: ''
// def prefix = task.ext.prefix ?: "${meta.id}"
"""

touch taxa/nodes.dmp
touch taxa/names.dmp

cat <<-END_VERSIONS > versions.yml
"${task.process}":
diamondpreparetaxa: \$(diamondpreparetaxa --version)
END_VERSIONS
"""
}
68 changes: 68 additions & 0 deletions modules/local/diamondpreparetaxa/meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/meta-schema.json
name: "diamondpreparetaxa"
## TODO nf-core: Add a description of the module and list keywords
description: write your description here
keywords:
- sort
- example
- genomics
tools:
- "diamondpreparetaxa":
## TODO nf-core: Add a description and other details for the software below
description: ""
homepage: ""
documentation: ""
tool_dev_url: ""
doi: ""
licence:
identifier:

## TODO nf-core: Add a description of all of the variables used as input
input:
# Only when we have meta
- - meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1' ]`

## TODO nf-core: Delete / customise this example input
- bam:
type: file
description: Sorted BAM/CRAM/SAM file
pattern: "*.{bam,cram,sam}"
ontologies:
- edam: "http://edamontology.org/format_2572" # BAM
- edam: "http://edamontology.org/format_2573" # CRAM
- edam: "http://edamontology.org/format_3462" # SAM

## TODO nf-core: Add a description of all of the variables used as output
output:
- bam:
#Only when we have meta
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1' ]`
## TODO nf-core: Delete / customise this example output
- "*.bam":
type: file
description: Sorted BAM/CRAM/SAM file
pattern: "*.{bam,cram,sam}"
ontologies:
- edam: "http://edamontology.org/format_2572" # BAM
- edam: "http://edamontology.org/format_2573" # CRAM
- edam: "http://edamontology.org/format_3462" # SAM

- versions:
- "versions.yml":
type: file
description: File containing software versions
pattern: "versions.yml"

authors:
- "@tracelail"
maintainers:
- "@tracelail"
Loading