Skip to content

BAM QUAL field is offset by +33 (Phred+33 ASCII bytes written instead of raw Phred values) #34

@pinin4fjords

Description

@pinin4fjords

Summary

rustar-aligner's BAM writer stores the raw FASTQ ASCII quality bytes (Phred+33 encoded) in the BAM's binary QUAL field, where the SAM spec mandates raw Phred values (0-93, no offset). Every QUAL byte in a rustar BAM is therefore +33 above the SAM-spec value. samtools view then adds another +33 when rendering as SAM text, so what should be BBBBBFFFFBBBF... (Phred 33-37, typical Illumina) comes out as cccccggggcccg... (would imply Phred 66-70, biologically impossible on any real sequencer).

Visible downstream symptoms in MultiQC and samtools stats:

Metric (samtools stats) STAR rustar Delta
average quality 35.3-35.5 68.3-68.5 +33 (consistent across 5 samples)
error rate ~0.0091 0.0 (companion symptom of #29 NM/nM, separate bug)

average_quality doubling was the smoking-gun signal that flagged this as something distinct from the NM tag bug.

SAM spec reference

From SAM v1 spec §1.4, QUAL field:

ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A base quality is the phred-scaled base error probability which equals -10 log10 Pr{base is wrong}.

The SAM text format ASCII-encodes with the +33 offset; the BAM binary format stores the raw Phred values without the offset (SAM spec §4.2.3):

qual: phred base quality (a sequence of 0xFF if absent) ... ranges from 0 to 93.

So a FASTQ char B (ASCII 66) represents Phred 33. The corresponding BAM byte should be 0x21 (33), and samtools view re-adds 33 to display it as B.

rustar appears to write 0x42 (66) directly into the BAM binary, so samtools view displays 66+33 = 99 = 'c', and samtools stats reads the byte as Phred 66 -> doubles the average.

Root cause in rustar

src/io/fastq.rs:55-60 documents the in-memory EncodedRead.quality as the raw FASTQ ASCII bytes:

/// A read from a FASTQ file with encoded bases
#[derive(Debug, Clone)]
pub struct EncodedRead {
    pub name: String,
    pub sequence: Vec<u8>,
    /// Quality scores (raw FASTQ quality values)
    pub quality: Vec<u8>,
}

src/io/sam.rs:632-637 passes those raw FASTQ bytes straight into noodles::sam's QualityScores::from:

let mut qual = read_qual.to_vec();
qual.reverse();
*record.quality_scores_mut() = QualityScores::from(qual);
// (and the non-reverse branch at L639)
*record.quality_scores_mut() = QualityScores::from(read_qual.to_vec());

QualityScores::from(Vec<u8>) in noodles is a thin wrapper that does not apply the Phred+33 offset — it expects the bytes to already be raw Phred values (matching the BAM spec). Net result: FASTQ ASCII bytes go straight to the BAM binary QUAL field, +33 above what they should be.

Reproducer (paired STAR + rustar)

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-qual && cd /tmp/rustar-mre-qual

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

# Pick a read present in both BAMs and compare QUAL strings
READ=SRR6357072.7414694
echo "=== Same read $READ, STAR QUAL ==="
docker run --rm -v $PWD:/w $STAR bash -c \
    "samtools view /w/STAR.Aligned.out.bam | awk -v n=$READ '\$1==n {print \$11; exit}'"
echo "=== Same read $READ, rustar QUAL ==="
docker run --rm -v $PWD:/w $STAR bash -c \
    "samtools view /w/RUS./Aligned.out.bam | awk -v n=$READ '\$1==n {print \$11; exit}'"

echo
echo "=== samtools stats: average_quality ==="
echo -n "rustar: "; docker run --rm -v $PWD:/w $STAR samtools stats /w/RUS./Aligned.out.bam | grep '^SN.*average quality:' | awk '{print $NF}'
echo -n "STAR  : "; docker run --rm -v $PWD:/w $STAR samtools stats /w/STAR.Aligned.out.bam | grep '^SN.*average quality:' | awk '{print $NF}'

Observed (verified on commit 5f8ad08 + STAR 2.7.11b)

Same read SRR6357072.7414694 in both BAMs:

=== STAR QUAL ===
BBBBBFFFFBBBFFFBBFFFFFFBBBF<<FBBFBFFF<<7FFFF/<FB7BF/</<FBFFFFFFFF</7FBFFBFBFF<F////FFFB/FFFFFFFFFFFB/

=== rustar QUAL ===
cccccggggcccgggccggggggcccg]]gccgcggg]]XggggP]gcXcgP]P]gcgggggggg]PXgcggcgcgg]gPPPPgggcPgggggggggggcP

Every byte in rustar's string is exactly 33 above the corresponding STAR byte:

STAR char (Phred) rustar char (samtools view shows) ASCII diff
B (33) c (66) +33
F (37) g (70) +33
< (27) ] (60) +33
7 (22) X (55) +33
/ (14) P (47) +33

samtools stats average quality across the five nf-core/rnaseq test profile samples:

Sample STAR rustar
WT_REP1 35.4 68.4
WT_REP2 35.5 68.5
RAP1_IAA_30M_REP1 35.4 68.4
RAP1_UNINDUCED_REP1 35.3 68.3
RAP1_UNINDUCED_REP2 35.3 68.3

Consistent +33 across every sample, every record.

Suggested fix

In src/io/sam.rs:632-640 (and the analogous transcriptome-path code in build_transcriptome_records around L611-L640), subtract the Phred+33 offset before handing the buffer to noodles:

// FASTQ-ASCII -> raw Phred for BAM binary QUAL
let qual_raw: Vec<u8> = read_qual.iter().map(|&b| b.saturating_sub(33)).collect();
*record.quality_scores_mut() = QualityScores::from(qual_raw);

(saturating_sub because a malformed FASTQ byte < 33 should clamp to 0, not underflow.)

Same conversion needed wherever EncodedRead.quality is plumbed into a noodles RecordBuf. Three call sites in src/io/sam.rs need updating (genome-PE, genome-SE, transcriptome).

A worthwhile companion change: rename the field doc to be explicit, e.g. quality: Vec<u8> doc comment to "Quality scores (FASTQ ASCII bytes, Phred+33 encoded — subtract 33 before writing to BAM)". The current comment "raw FASTQ quality values" is ambiguous — the value is "raw FASTQ", but FASTQ's raw values are already +33 over Phred.

Test plan

  1. Run rustar on any paired-end FASTQ.
  2. Assert samtools stats <bam> average quality is in the realistic Illumina range (25-42), not 60+.
  3. Assert per-byte deltas vs STAR are zero on a record present in both BAMs.
  4. Add a small unit test in src/io/sam.rs that feeds known FASTQ ASCII quality (e.g. b"III" = Phred 40,40,40) and asserts the resulting RecordBuf quality bytes are [40, 40, 40].

Why this matters

  • samtools stats is consumed by every modern QC pipeline. With QUAL bytes +33 too high, average quality doubles, error rate is wrong (compounding the #29 NM/nM zero-error symptom), and every quality histogram is shifted.
  • MultiQC displays Samtools Stats numbers prominently; the "average quality 68" entry is biologically impossible and will be flagged by any operator familiar with Illumina output.
  • Quality-aware variant callers (bcftools, GATK, DeepVariant) treat QUAL bytes as Phred values directly. A +33 offset elevates every base score past the calibration range these callers expect; some will silently emit wrong likelihoods, others will clip at 93 and lose information.
  • Quality-aware trimmers that operate on the BAM (rather than re-trimming from FASTQ) will see all bases as "high quality" and do nothing.

This and #29 NM/nM together make samtools stats and any quality-aware downstream tool unsafe to use on rustar BAMs until both are fixed.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions