- Author(s): Boas van der Putten, Roxanne Wolthuis
- Organization: Rijksinstituut voor Volksgezondheid en Milieu (RIVM)
- Department: Infektieziekteonderzoek, Diagnostiek en Laboratorium Surveillance (IDS), Bacteriologie (BPD)
- Start date: 07 - 04 - 2023
- Commissioned by: Thijs Bosch
Apollo-mapping is the first pipeline created in the Apollo pipeline series. The Goal of these pipelines is to set up a routine surveillance for fungi (A.fumigatus, Candida). The apollo-mapping pipeline is created with the juno-template and juno-library.
The input of the pipeline is raw Illumina paired-end data in the form of two fastq files (with extension .fastq, .fastq.gz, .fq or .fq.gz), containing the forward and the reversed reads ('R1' and 'R2' must be part of the file name, respectively).
The pipeline uses the following tools(NOT COMPLETE):
- FastQC (Andrews, 2010) is used to assess the quality of the raw Illumina reads
- FastP (Chen, Zhou, Chen and Gu, 2018) is used to remove poor quality data and adapter sequences
- Picard determines the library fragment lengths
- MultiQC (Ewels, Magnusson, Lundin, & Käller, 2016) is used to summarize analysis results and quality assessments in a single report for dynamic visualization.
- Kraken2 and Bracken for identification of fungal species.
- Linux environment
- (mini)conda
- Python 3.11
- Clone the repository.
git clone https://github.com/RIVM-bioinformatics/apollo-mapping.git
- Go to the pipeline directory.
cd apollo-mapping
- Create & activate mamba environment.
conda env update -f envs/mamba.yaml
conda activate mamba
- Create & activate apollo environment.
mamba env update -f envs/apollo_mapping.yaml
conda activate apollo_mapping
- Example of run:
python3 apollo_mapping.py -i [input] -o [output] -s [species]
-h, --helpShows the help of the pipeline
-i, --inputRelative or absolute path to the input directory. It must contain all the raw reads (fastq) files for all samples to be processed (not in subfolders)-s, --speciesSpecies to use, choose from: ['candida_auris', 'aspergillus_fumigatus']
-o --outputRelative or absolute path to the output directory. If none is given, an 'output' directory will be created in the current directory-w, --workdirRelative or absolute path to the working directory. If none is given, the current directory is used.-ex, --exclusionfilePath to the file that contains samplenames to be excluded.-p, --prefixConda or singularity prefix. Basically a path to the place where you want to store the conda environments or the singularity images.-l, --localIf this flag is present, the pipeline will be run locally (not attempting to send the jobs to an HPC cluster**). The default is to assume that you are working on a cluster. **Note that currently only LSF clusters are supported.-tl, --time-limitTime limit per job in minutes (passed as -W argument to bsub). Jobs will be killed if not finished in this time.-u, --unlockUnlock output directory (passed to snakemake).-n, --dryrunDry run printing steps to be taken in the pipeline without actually running it (passed to snakemake).-q, --queueName of the queue that the job will be submitted to if working on a cluster.-mpt, --mean-quality-tresholdPhred score to be used as threshold for cleaning (filtering) fastq files.-ws, --window-sizeWindow size to use for cleaning (filtering) fastq files.-ml, --minimum-lenthMinimum length for fastq reads to be kept after trimming.--no-containersUse conda environments instead of containers.--snakemake-argsExtra arguments to be passed to snakemake API (https://snakemake.readthedocs.io/en/stable/api_reference/snakemake.html).--referenceReference genome to use default is chosen based on species argument, defaults per species can be found in: /mnt/db/apollo/mapping/[species]--db-dirKraken2 database directory (should include fungi!)
python3 apollo-mapping.py -i [dir/to/fasta_or_fastq_files] -s [species]
python3 apollo-mapping.py -i [dir/to/fasta_or_fastq_files] -o [/path/to/output/location] -s aspergillus_fumigatus
Detailed information about the pipeline can be found in the [documentation](link to other docs). This documentation is only suitable for users that have access to the RIVM Linux environment.
- audit_trail: Logs of conda, git and the pipeline, a sample sheet, the used parameters and a snakemake report.
- clean_fastq: cleaned fastq files.
- identify_species: Output of kraken and bracken for species identification.
- log: Log with output and error file from the cluster for each Snakemake rule/step that is performed.
- mapped_reads: Mapping output.
- multiqc: Multiqc output and multiqc html report.
- qc_clean_fastq: Quality control of clean fastq reads.
- qc_mapping: Quality control of mapping.
- reference: Reference genome used.
- variant: Variant calling results.
- This pipeline only works on the RIVM cluster.
- Make this pipeline available and user friendly for users outside RIVM.
This pipeline is licensed with a AGPL3 license. Detailed information can be found inside the 'LICENSE' file in this repository.
- Contact person: IDS-Bioinformatics
- Email: [email protected]
Apollo pipelines use a feature branch workflow. To work on features, create a branch from the main branch to make changes to. This branch can be merged to the main branch via a pull request. Hotfixes for bugs can be committed to the main branch.
Please adhere to the conventional commits specification for commit messages. These commit messages can be picked up by release please to create meaningful release messages.