Downsample reads in large fastq files

Fastq files with very large numbers of reads could take a very long time to run through the workflow. To address this, we are currently removing any samples with > 1500 megabases of sequence data with this `augur filter` parameter:

```
--query "(mbases > 180 & mbases < 1500 & (country != 'Uncalculated'))"
```

Instead, we could address this issue by randomly downsampling sequence reads for very large fastq files. We could potentially downsample these files to 1 gigabase, since that should be more than enough reads for high genotyping quality.

This downsampling would need to take into account the fact that samples differ in read length and whether they have single end or paired end reads.

This was discussed on [Slack](https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1744073242834039?thread_ts=1743792510.979549&cid=C01LCTT7JNN).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Downsample reads in large fastq files #20

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Downsample reads in large fastq files #20

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions