-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Description:
We need to generate a small, controlled in silico test dataset for the viral metagenomics pipeline. This will be used for validating functionality and benchmarking performance.
The dataset should include:
- A
metadata.csvfile with a list of organisms (viruses, bacteria, fungi, and host genomes like human or other mammals) - Simulated FASTQ files for both Illumina and Nanopore sequencing platforms
- For each sample, we should define the expected % of reads from each organism (to later test detection/quantification accuracy)
The goal is to create synthetic samples that closely resemble real-world metagenomic data in complexity and composition.
Deliverables:
metadata.csvwith organism names and assigned abundance percentages per sample- Paired-end FASTQ files for Illumina
- Single-end FASTQ files (or appropriate format) for Nanopore
- Clear documentation on how the data was generated (tools, parameters, etc.)
Useful tools for in silico dataset generation:
Here are some open-source tools that can help with generating synthetic metagenomic data:
-
[CAMISIM](https://github.com/CAMI-challenge/CAMISIM)
- A powerful tool for simulating microbial communities and metagenomic sequencing datasets.
- Supports both Illumina and Nanopore.
- Can simulate taxonomic compositions and include strain-level variation.
-
[InSilicoSeq](https://github.com/HadrienG/InSilicoSeq)
- Focused on realistic Illumina read simulation from genomes.
- Allows setting error models and read proportions.
-
[NanoSim](https://github.com/bcgsc/NanoSim)
- Simulator for Oxford Nanopore reads.
- Can be trained on real data or use predefined profiles.
-
[art](https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm)
- A classic Illumina read simulator.
- Good for simple and fast simulations.
-
[Grinder](https://github.com/zlinsly/grinder)
- General-purpose read simulator for amplicon and shotgun sequencing.
- Can generate mixed community samples with specific abundance profiles.
-
[NeatSeq-Flow’s simulator module](https://neatseq-flow.readthedocs.io/en/latest/Modules.html#simulatefastq)
- Simple for small datasets, can complement other tools.
Next steps:
- Decide on the list of organisms to include (and find/download their genomes)
- Define desired composition for a small number of test samples (e.g., 3-5 samples)
- Select tool(s) for simulation and document the choice
- Generate and validate synthetic reads
- Upload test data to the repo or a public bucket (Zenodo, S3, etc.)