Benchmark DIAMOND for pathogen detection and reference-level classification

**Description:**

We want to test the performance of **DIAMOND** for detecting viruses and identifying the nearest genomic reference of pathogens in metagenomic samples. The goal is to evaluate how well DIAMOND performs in comparison to other commonly used tools such as **Kraken2** and **Mash**, particularly for viral classification.

This will help us decide whether DIAMOND should be integrated into the pipeline and under what use cases it performs best.

---

**Objectives:**

- Use DIAMOND to align reads (or contigs) against a protein database (e.g., NCBI nr, viral subset, or custom)
- Extract taxonomic information and nearest-genomic reference(s) for detected pathogens
- Compare results with Kraken2 (k-mer based classification) and Mash (genome sketching)
- Perform benchmarking using the **in silico dataset** (see BU-ISCIII/pikavirus#7 )

---

**Tasks:**

- [ ] Run DIAMOND on simulated reads or assembled contigs
- [ ] Parse DIAMOND output to extract best hits (e.g., bit score, e-value, % identity)
- [ ] Run Kraken2 and Mash on the same data
- [ ] Compare:
  - Detection sensitivity and specificity
  - Precision of taxonomic assignment (ideally at species/strain level)
  - Runtime and memory usage
  - Ability to identify novel or divergent viruses
- [ ] Summarize benchmarking results in a table or plot
- [ ] Document tool versions, parameters, and databases used

---

**Relevant Tools:**

1. **[[DIAMOND](https://github.com/bbuchfink/diamond)](https://github.com/bbuchfink/diamond)**  
   - High-speed aligner for translated DNA vs protein database (BLASTX alternative)
   - Useful for detecting distant homologs and novel viruses
   - Supports output in BLAST tabular format with taxonomy ID

2. **[[Kraken2](https://github.com/DerrickWood/kraken2)](https://github.com/DerrickWood/kraken2)**  
   - Ultra-fast k-mer based classifier using a prebuilt database (e.g., RefSeq)
   - Good for high-throughput classification, but may struggle with novel or highly divergent sequences

3. **[[Mash](https://github.com/marbl/Mash)](https://github.com/marbl/Mash)**  
   - Fast genome distance estimation using MinHash
   - Good for rapid comparison against reference genomes (useful for nearest-reference identification)

**Reference Database & Taxonomy Mapping:**

### 🔬 **DIAMOND**
- **Input:** Protein database in FASTA format (e.g., NCBI NR, RefSeq viral proteins, or a custom DB)
- **Recommendation:** Use **NCBI RefSeq viral proteins** or a curated subset for manageable size.
- Run `diamond makedb` to create index.
- To retain taxonomy info:
  - Use DIAMOND's **--taxonmap** and **--taxonnodes** options.
  - Use files from NCBI:  
    - `prot.accession2taxid` (from [[ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/](https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/)](https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/))
    - `nodes.dmp`, `names.dmp` (from [[ftp.ncbi.nih.gov/pub/taxonomy/](https://ftp.ncbi.nih.gov/pub/taxonomy/)](https://ftp.ncbi.nih.gov/pub/taxonomy/))
  - Or use **MEGAN-compatible output** (`--outfmt 100`) if integrating later with MEGAN for taxonomy binning.

### 🧬 **Kraken2**
- Prebuilt databases:  
  - `standard` includes viral, bacterial, archaeal, and human genomes from RefSeq.
  - You can also **build a custom database** with viral-only sequences for focused benchmarks.
- Automatically includes taxonomy IDs.
- Kraken2 uses `kraken2-build --download-taxonomy` to fetch taxonomy files and link them.

### 🧪 **Mash**
- Works on genome assemblies or sketches.
- Reference set must be carefully curated and annotated.
- Use the NCBI RefSeq viral genomes, and ensure FASTA headers include identifying information (e.g., accession, species).
- Optionally link outputs back to TaxIDs using a reference lookup table.

**Expected output:**

- A benchmark table/report showing:
  - Which pathogens were detected by each tool
  - Their closest matching reference genome
  - Classification accuracy
  - Runtime and computational efficiency
- A recommendation on whether/how to integrate DIAMOND (e.g., post-assembly validation, secondary classifier, etc.)

| Sample | Tool     | Pathogen Detected | Closest Reference | TaxID | % Identity / Score | Detection Accuracy | Notes |
|--------|----------|-------------------|--------------------|-------|--------------------|---------------------|-------|
| S1     | DIAMOND  | Human adenovirus C | NC_001405.1        | 10508 | 97.5% / 189.6      | ✅ True Positive     | Best hit matches expected strain |
| S1     | Kraken2  | Human adenovirus   | TaxID 10508        | 10508 | -                  | ✅ True Positive     | No strain resolution |
| S1     | Mash     | Human adenovirus C | NC_001405.1        | 10508 | 0.02 Mash distance | ✅ True Positive     | Correct nearest ref |

You could expand this to include multiple organisms per sample, or summarize results as a confusion matrix or precision-recall curve.

---

**Next Steps:**

- [ ] Select or curate reference databases for each tool
- [ ] Document database creation and taxonomy mapping
- [ ] Run tools on in silico dataset (see #X)
- [ ] Parse and compare output
- [ ] Summarize detection performance and nearest-genome accuracy


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark DIAMOND for pathogen detection and reference-level classification #8

🔬 DIAMOND

🧬 Kraken2

🧪 Mash

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sample	Tool	Pathogen Detected	Closest Reference	TaxID	% Identity / Score	Detection Accuracy	Notes
S1	DIAMOND	Human adenovirus C	NC_001405.1	10508	97.5% / 189.6	✅ True Positive	Best hit matches expected strain
S1	Kraken2	Human adenovirus	TaxID 10508	10508	-	✅ True Positive	No strain resolution
S1	Mash	Human adenovirus C	NC_001405.1	10508	0.02 Mash distance	✅ True Positive	Correct nearest ref

Benchmark DIAMOND for pathogen detection and reference-level classification #8

Description

🔬 DIAMOND

🧬 Kraken2

🧪 Mash

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions