Main tool : Rasusa
Additional tools:
- none
Full documentation: https://github.com/mbhall88/rasusa
Randomly subsample sequencing reads to a specified coverage.
Note that rasusa requires certain parameters:
-c,--coverage, which can be supplemented by (-b,--bases), (-n,--num), and (-f,--frac) inreads, but is required foraln-g,--genome-size, which can be supplemented by (-b,--bases), (-n,--num), and (-f,--frac) inreads, and is not an argument foraln.- input
- Valid FASTA or FASTQ format for the
readscommand - Valid coordinate-sorted SAM/BAM/CRAM for the
alncommand - Can be compressed via gzip
- If 2 inputs are passed to
reads, it is assumed they are paired-end reads.
- Valid FASTA or FASTQ format for the
Note: Version 3.0.0 introduces a major change from versions < 3.0.0.
- deps: Subsampling results for a fixed seed will differ from versions < 3.0.0. This is caused by internal algorithmic changes in the rand crate (0.8.5 -> 0.10.0) and requires a major version bump.
# sars-cov-2 example, paired-end illumina
rasusa reads \
-n 40434 \ # downsample to specific number of reads per FASTQ file
-s 1 \ # set seed
-O g \ # set output file compression format as gzip
-o SRR13957123_downsampled_1.fastq.gz -o SRR13957123_downsampled_2.fastq.gz \
SRR13957123_1.fastq.gz SRR13957123_2.fastq.gz
# Salmonella enterica example, paired-end illumina
rasusa reads \
--coverage 100 \ # use 100X coverage for downsampling
--genome-size 5M \ # downsample to specific coverage based on genome size (5 million bases)
-s 1 \ # set seed
-O g \ # set output file compression format as gzip
-o SRR10992628_downsampled_1.fastq.gz -o SRR10992628_downsampled_2.fastq.gz \
SRR10992628_1.fastq.gz SRR10992628_2.fastq.gz