Skip to content

--quantMode TranscriptomeSAM paired-end BAM omits mate fields, breaking Salmon TPMs #22

@pinin4fjords

Description

@pinin4fjords

Summary

For paired-end input, rustar-aligner --quantMode TranscriptomeSAM writes Aligned.toTranscriptome.out.bam records with SEGMENTED + FIRST_SEGMENT / LAST_SEGMENT set but not PROPERLY_SEGMENTED (0x2), and with RNEXT=*, PNEXT=0, TLEN=0 on every record. Both mates project to the same transcript at correct positions; they just aren't recorded as mates of each other.

Downstream Salmon's alignment-mode quant cannot infer a fragment-length distribution from these records and falls back to its default prior (mean=250, sd=25), inflating TPMs of short transcripts. NumReads agreement with STAR stays high (gene Pearson ~0.9998); gene TPM Pearson drops to ~0.985 on identical FASTQ.

STAR reference behaviour

STAR's genome-space paired record builder sets PROPERLY_SEGMENTED, MATE_REVERSE_COMPLEMENTED, the mate ref / position fields, and the signed TLEN — see source/SAMfunctions.cpp and the BAM-attribute path in source/ReadAlign_outputTranscriptSAM.cpp. The transcriptome BAM emission uses the same per-pair bookkeeping by construction — once a pair projects to a shared transcript, the mate flags and TLEN are stamped.

Reproducer

A single bash script that runs both aligners on the same inputs and side-by-side compares the transcriptome BAM:

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-22 && cd /tmp/rustar-mre-22

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2
        --quantMode TranscriptomeSAM --quantTranscriptomeSAMoutput BanSingleEnd)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

# --- Verify: mate-field absence in rustar's transcriptome BAM ---
RT=RUS./Aligned.toTranscriptome.out.bam
ST=STAR.Aligned.toTranscriptome.out.bam
for label in rustar STAR; do
    if [ $label = rustar ]; then BAM=$RT; else BAM=$ST; fi
    total=$(docker run --rm -v $PWD:/w $STAR samtools view -c /w/$BAM)
    proper=$(docker run --rm -v $PWD:/w $STAR samtools view -c -f 2 /w/$BAM)
    echo "$label: $proper / $total properly-paired"
done

echo
echo "First two records (NAME FLAG RNAME POS RNEXT PNEXT TLEN), rustar then STAR:"
docker run --rm -v $PWD:/w $STAR samtools view /w/$RT | awk 'BEGIN{OFS="\t"} {print "rustar", $1, $2, $3, $4, $7, $8, $9}' | head -2
docker run --rm -v $PWD:/w $STAR samtools view /w/$ST | awk 'BEGIN{OFS="\t"} {print "STAR  ", $1, $2, $3, $4, $7, $8, $9}' | head -2

Observed (verified on commit 5f8ad08 + STAR 2.7.11b, fresh paired run)

rustar: 0 / 77300 properly-paired
STAR:   78886 / 78886 properly-paired

First record SRR6357072.1900203:

Field STAR (r1 / r2) rustar (r1 / r2)
FLAG 163 / 83 81 / 129
RNAME YAR010C YAR010C
POS 1045 / 1103 1103 / 1045
RNEXT = / = * / *
PNEXT 1103 / 1045 0 / 0
TLEN 159 / -159 0 / 0

Root cause

src/lib.rs::build_transcriptome_records_pe, the post-processing at the end of the function (lines ~762-768):

use noodles::sam::alignment::record::Flags;
for r in rec1s.iter_mut() {
    *r.flags_mut() |= Flags::SEGMENTED | Flags::FIRST_SEGMENT;
}
for r in rec2s.iter_mut() {
    *r.flags_mut() |= Flags::SEGMENTED | Flags::LAST_SEGMENT;
}

This is the only pair-aware fix-up applied after the two SamWriter::build_transcriptome_records calls (each of which builds its list as if it were a single-end alignment list). Never set on transcriptome records:

  • Flags::PROPERLY_SEGMENTED (0x2)
  • Flags::MATE_REVERSE_COMPLEMENTED (0x20)
  • record.mate_reference_sequence_id_mut() -> RNEXT stays *
  • record.mate_alignment_start_mut() -> PNEXT stays 0
  • record.template_length_mut() -> TLEN stays 0

The underlying SamWriter::build_transcriptome_records (src/io/sam.rs:566-660) only sets REVERSE_COMPLEMENTED and SECONDARY — it has no concept of "mate".

The genome-space paired path in src/io/sam.rs::build_paired_mate_record (lines 1163-1260) does set all of the above. The logic exists; it just isn't wired into the transcriptome path.

Suggested fix

In build_transcriptome_records_pe, after the existing flag-stamping loops and before the interleave, walk rec1s.iter_mut().zip(rec2s.iter_mut()) alongside p1s.iter().zip(p2s.iter()) and set on both mates:

  • Flags::PROPERLY_SEGMENTED (both mates are guaranteed to share a transcript by construction).
  • MATE_REVERSE_COMPLEMENTED iff the other mate's REVERSE_COMPLEMENTED flag is set.
  • mate_reference_sequence_id_mut(Some(other.chr_idx)).
  • mate_alignment_start_mut(Some(other.alignment_start)).
  • template_length_mut(signed_insert_size) — STAR convention: magnitude max(end1, end2) - min(start1, start2) + 1, leftmost mate +TLEN, rightmost -TLEN.

Alternatively, factor the mate-bookkeeping out of build_paired_mate_record into a helper used by both paths.

Why this matters

Any paired-aware downstream tool (Salmon alignment-mode, RSEM, custom fragment-size QC, tximeta wrappers) either fails or silently degrades. Salmon silently degrades — users get plausible-looking TPMs that are systematically biased toward short transcripts.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions