Create a testing viral metagenomics dataset

**Description:**

We need to generate a small, controlled in silico test dataset for the viral metagenomics pipeline. This will be used for validating functionality and benchmarking performance.

The dataset should include:

- A `metadata.csv` file with a list of organisms (viruses, bacteria, fungi, and host genomes like human or other mammals)
- Simulated FASTQ files for both **Illumina** and **Nanopore** sequencing platforms
- For each sample, we should define the expected **% of reads** from each organism (to later test detection/quantification accuracy)

The goal is to create synthetic samples that closely resemble real-world metagenomic data in complexity and composition.

---

**Deliverables:**

- `metadata.csv` with organism names and assigned abundance percentages per sample
- Paired-end FASTQ files for Illumina
- Single-end FASTQ files (or appropriate format) for Nanopore
- Clear documentation on how the data was generated (tools, parameters, etc.)

---

**Useful tools for in silico dataset generation:**

Here are some open-source tools that can help with generating synthetic metagenomic data:

1. **[[CAMISIM](https://github.com/CAMI-challenge/CAMISIM)](https://github.com/CAMI-challenge/CAMISIM)**  
   - A powerful tool for simulating microbial communities and metagenomic sequencing datasets.
   - Supports both Illumina and Nanopore.
   - Can simulate taxonomic compositions and include strain-level variation.

2. **[[InSilicoSeq](https://github.com/HadrienG/InSilicoSeq)](https://github.com/HadrienG/InSilicoSeq)**  
   - Focused on realistic Illumina read simulation from genomes.
   - Allows setting error models and read proportions.

3. **[[NanoSim](https://github.com/bcgsc/NanoSim)](https://github.com/bcgsc/NanoSim)**  
   - Simulator for Oxford Nanopore reads.
   - Can be trained on real data or use predefined profiles.

4. **[[art](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm)](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm)**  
   - A classic Illumina read simulator.
   - Good for simple and fast simulations.

5. **[[Grinder](https://github.com/zlinsly/grinder)](https://github.com/zlinsly/grinder)**  
   - General-purpose read simulator for amplicon and shotgun sequencing.
   - Can generate mixed community samples with specific abundance profiles.

6. **[[NeatSeq-Flow’s simulator module](https://neatseq-flow.readthedocs.io/en/latest/Modules.html#simulatefastq)](https://neatseq-flow.readthedocs.io/en/latest/Modules.html#simulatefastq)**  
   - Simple for small datasets, can complement other tools.

---

**Next steps:**

- [ ] Decide on the list of organisms to include (and find/download their genomes)
- [ ] Define desired composition for a small number of test samples (e.g., 3-5 samples)
- [ ] Select tool(s) for simulation and document the choice
- [ ] Generate and validate synthetic reads
- [ ] Upload test data to the repo or a public bucket (Zenodo, S3, etc.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a testing viral metagenomics dataset #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create a testing viral metagenomics dataset #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions