Skip to content

Benchmark DIAMOND for pathogen detection and reference-level classification #8

@saramonzon

Description

@saramonzon

Description:

We want to test the performance of DIAMOND for detecting viruses and identifying the nearest genomic reference of pathogens in metagenomic samples. The goal is to evaluate how well DIAMOND performs in comparison to other commonly used tools such as Kraken2 and Mash, particularly for viral classification.

This will help us decide whether DIAMOND should be integrated into the pipeline and under what use cases it performs best.


Objectives:

  • Use DIAMOND to align reads (or contigs) against a protein database (e.g., NCBI nr, viral subset, or custom)
  • Extract taxonomic information and nearest-genomic reference(s) for detected pathogens
  • Compare results with Kraken2 (k-mer based classification) and Mash (genome sketching)
  • Perform benchmarking using the in silico dataset (see Create a testing viral metagenomics dataset #7 )

Tasks:

  • Run DIAMOND on simulated reads or assembled contigs
  • Parse DIAMOND output to extract best hits (e.g., bit score, e-value, % identity)
  • Run Kraken2 and Mash on the same data
  • Compare:
    • Detection sensitivity and specificity
    • Precision of taxonomic assignment (ideally at species/strain level)
    • Runtime and memory usage
    • Ability to identify novel or divergent viruses
  • Summarize benchmarking results in a table or plot
  • Document tool versions, parameters, and databases used

Relevant Tools:

  1. [DIAMOND](https://github.com/bbuchfink/diamond)

    • High-speed aligner for translated DNA vs protein database (BLASTX alternative)
    • Useful for detecting distant homologs and novel viruses
    • Supports output in BLAST tabular format with taxonomy ID
  2. [Kraken2](https://github.com/DerrickWood/kraken2)

    • Ultra-fast k-mer based classifier using a prebuilt database (e.g., RefSeq)
    • Good for high-throughput classification, but may struggle with novel or highly divergent sequences
  3. [Mash](https://github.com/marbl/Mash)

    • Fast genome distance estimation using MinHash
    • Good for rapid comparison against reference genomes (useful for nearest-reference identification)

Reference Database & Taxonomy Mapping:

🔬 DIAMOND

🧬 Kraken2

  • Prebuilt databases:
    • standard includes viral, bacterial, archaeal, and human genomes from RefSeq.
    • You can also build a custom database with viral-only sequences for focused benchmarks.
  • Automatically includes taxonomy IDs.
  • Kraken2 uses kraken2-build --download-taxonomy to fetch taxonomy files and link them.

🧪 Mash

  • Works on genome assemblies or sketches.
  • Reference set must be carefully curated and annotated.
  • Use the NCBI RefSeq viral genomes, and ensure FASTA headers include identifying information (e.g., accession, species).
  • Optionally link outputs back to TaxIDs using a reference lookup table.

Expected output:

  • A benchmark table/report showing:
    • Which pathogens were detected by each tool
    • Their closest matching reference genome
    • Classification accuracy
    • Runtime and computational efficiency
  • A recommendation on whether/how to integrate DIAMOND (e.g., post-assembly validation, secondary classifier, etc.)
Sample Tool Pathogen Detected Closest Reference TaxID % Identity / Score Detection Accuracy Notes
S1 DIAMOND Human adenovirus C NC_001405.1 10508 97.5% / 189.6 ✅ True Positive Best hit matches expected strain
S1 Kraken2 Human adenovirus TaxID 10508 10508 - ✅ True Positive No strain resolution
S1 Mash Human adenovirus C NC_001405.1 10508 0.02 Mash distance ✅ True Positive Correct nearest ref

You could expand this to include multiple organisms per sample, or summarize results as a confusion matrix or precision-recall curve.


Next Steps:

  • Select or curate reference databases for each tool
  • Document database creation and taxonomy mapping
  • Run tools on in silico dataset (see #X)
  • Parse and compare output
  • Summarize detection performance and nearest-genome accuracy

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions