-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Description:
We want to test the performance of DIAMOND for detecting viruses and identifying the nearest genomic reference of pathogens in metagenomic samples. The goal is to evaluate how well DIAMOND performs in comparison to other commonly used tools such as Kraken2 and Mash, particularly for viral classification.
This will help us decide whether DIAMOND should be integrated into the pipeline and under what use cases it performs best.
Objectives:
- Use DIAMOND to align reads (or contigs) against a protein database (e.g., NCBI nr, viral subset, or custom)
- Extract taxonomic information and nearest-genomic reference(s) for detected pathogens
- Compare results with Kraken2 (k-mer based classification) and Mash (genome sketching)
- Perform benchmarking using the in silico dataset (see Create a testing viral metagenomics dataset #7 )
Tasks:
- Run DIAMOND on simulated reads or assembled contigs
- Parse DIAMOND output to extract best hits (e.g., bit score, e-value, % identity)
- Run Kraken2 and Mash on the same data
- Compare:
- Detection sensitivity and specificity
- Precision of taxonomic assignment (ideally at species/strain level)
- Runtime and memory usage
- Ability to identify novel or divergent viruses
- Summarize benchmarking results in a table or plot
- Document tool versions, parameters, and databases used
Relevant Tools:
-
[DIAMOND](https://github.com/bbuchfink/diamond)
- High-speed aligner for translated DNA vs protein database (BLASTX alternative)
- Useful for detecting distant homologs and novel viruses
- Supports output in BLAST tabular format with taxonomy ID
-
[Kraken2](https://github.com/DerrickWood/kraken2)
- Ultra-fast k-mer based classifier using a prebuilt database (e.g., RefSeq)
- Good for high-throughput classification, but may struggle with novel or highly divergent sequences
-
[Mash](https://github.com/marbl/Mash)
- Fast genome distance estimation using MinHash
- Good for rapid comparison against reference genomes (useful for nearest-reference identification)
Reference Database & Taxonomy Mapping:
🔬 DIAMOND
- Input: Protein database in FASTA format (e.g., NCBI NR, RefSeq viral proteins, or a custom DB)
- Recommendation: Use NCBI RefSeq viral proteins or a curated subset for manageable size.
- Run
diamond makedbto create index. - To retain taxonomy info:
- Use DIAMOND's --taxonmap and --taxonnodes options.
- Use files from NCBI:
prot.accession2taxid(from [ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/](https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/))nodes.dmp,names.dmp(from [ftp.ncbi.nih.gov/pub/taxonomy/](https://ftp.ncbi.nih.gov/pub/taxonomy/))
- Or use MEGAN-compatible output (
--outfmt 100) if integrating later with MEGAN for taxonomy binning.
🧬 Kraken2
- Prebuilt databases:
standardincludes viral, bacterial, archaeal, and human genomes from RefSeq.- You can also build a custom database with viral-only sequences for focused benchmarks.
- Automatically includes taxonomy IDs.
- Kraken2 uses
kraken2-build --download-taxonomyto fetch taxonomy files and link them.
🧪 Mash
- Works on genome assemblies or sketches.
- Reference set must be carefully curated and annotated.
- Use the NCBI RefSeq viral genomes, and ensure FASTA headers include identifying information (e.g., accession, species).
- Optionally link outputs back to TaxIDs using a reference lookup table.
Expected output:
- A benchmark table/report showing:
- Which pathogens were detected by each tool
- Their closest matching reference genome
- Classification accuracy
- Runtime and computational efficiency
- A recommendation on whether/how to integrate DIAMOND (e.g., post-assembly validation, secondary classifier, etc.)
| Sample | Tool | Pathogen Detected | Closest Reference | TaxID | % Identity / Score | Detection Accuracy | Notes |
|---|---|---|---|---|---|---|---|
| S1 | DIAMOND | Human adenovirus C | NC_001405.1 | 10508 | 97.5% / 189.6 | ✅ True Positive | Best hit matches expected strain |
| S1 | Kraken2 | Human adenovirus | TaxID 10508 | 10508 | - | ✅ True Positive | No strain resolution |
| S1 | Mash | Human adenovirus C | NC_001405.1 | 10508 | 0.02 Mash distance | ✅ True Positive | Correct nearest ref |
You could expand this to include multiple organisms per sample, or summarize results as a confusion matrix or precision-recall curve.
Next Steps:
- Select or curate reference databases for each tool
- Document database creation and taxonomy mapping
- Run tools on in silico dataset (see #X)
- Parse and compare output
- Summarize detection performance and nearest-genome accuracy