FASTQ is a text-based format for storing biological sequence data together with per-base quality scores. Each sequencing read is represented by exactly four lines:
- Sequence identifier — begins with
@, followed by a unique read ID and optional description. - Nucleotide sequence — the raw base calls (
A,T,G,C,N). - Separator — a
+character, optionally followed by the sequence identifier repeated. - Quality string — a string of ASCII characters, the same length as the sequence, encoding a Phred quality score for each base.
Each character in the quality string maps to a score via its ASCII code. In the Sanger/Phred+33 encoding used here, subtracting 33 from the ASCII value gives the quality score. A score of 40 (ASCII I) means a 1-in-10,000 chance of a sequencing error; a score of 10 (ASCII +) means a 1-in-10 chance.
FASTQ files may be plain text (.fastq) or gzip-compressed (.fastq.gz).
def quality_per_position_boxplot_data(fastq_filename: str) -> list[Boxplot]Reads a FASTQ file and returns one boxplot data structure per read position, summarizing the distribution of quality scores across all reads at that position.
-
Loads quality scores — parses the file and converts every quality string into a list of integer scores, producing one list per read.
-
Transposes by position — groups scores from all reads by their position index. Reads can differ in length.
-
Computes statistics per position — for each position the non-missing scores are sorted and the following values are calculated:
Field Description centerLine.minLowest score at this position centerLine.maxHighest score at this position box.q125th percentile (lower quartile) box.q375th percentile (upper quartile) median50th percentile averageArithmetic mean Percentiles are computed with linear interpolation between adjacent sorted values.
-
Returns a
list[Boxplot], one entry per position (0-indexed, stored in thenamefield), ordered from the first to the last position present in any read.
class Boxplot(TypedDict):
centerLine: CenterLine # {"min": float, "max": float}
box: Box # {"q1": float, "q3": float}
median: float
average: float
name: str # string representation of the 0-based position indexboxplots = quality_per_position_boxplot_data("reads.fastq.gz")
# boxplots[0] -> statistics for the first base of every read
# boxplots[1] -> statistics for the second base of every read
# ...