Skip to content

Transcriptome BAM (--quantMode TranscriptomeSAM) omits per-record RG:Z: tag #32

@pinin4fjords

Description

@pinin4fjords

Summary

The transcriptome BAM emitted by rustar-aligner --quantMode TranscriptomeSAM includes the @RG header line correctly, but no per-record RG:Z: tags — even when --outSAMattrRGline is supplied and the genome-space Aligned.out.bam has RG:Z: on every record.

BAM @RG header Records Records with RG:Z:
RUS./Aligned.out.bam (rustar, genome) yes 88 592 88 592
RUS./Aligned.toTranscriptome.out.bam (rustar, transcript) yes 77 300 0
STAR.Aligned.out.bam (STAR, genome) yes 88 072 88 072
STAR.Aligned.toTranscriptome.out.bam (STAR, transcript) yes 78 886 78 886

Breaks any tool that splits records in the transcriptome BAM by RG:Z: (multi-sample bundled BAMs, custom QC keyed on read group, downstream re-merge flows).

STAR reference behaviour

STAR maintains two independent attribute order lists, outSAMattrOrder (genome BAM) and outSAMattrOrderQuant (transcriptome BAM), and the RG-attribute promotion adds the tag to both.

source/Parameters_samAttributes.cpp:201-205:

if (outSAMattrRGline[0]!="-" && !outSAMattrPresent.RG) {
    outSAMattrOrder.push_back(ATTR_RG);
    outSAMattrOrderQuant.push_back(ATTR_RG);   // <-- same RG goes to both lists
    outSAMattrPresent.RG=true;
    inOut->logMain << "WARNING --outSAMattrRG defines a read group, therefore STAR will output RG attribute" <<endl;
};

And explicit RG handling at source/Parameters_samAttributes.cpp:96-99 pushes to both outSAMattrOrder and outSAMattrOrderQuant. The transcriptome SAM/BAM header is written in source/samHeaders.cpp:14-17 with the same @RG line. STAR's invariant: if @RG is in either header, the per-record RG:Z: tag is in every record of that BAM.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-32 && cd /tmp/rustar-mre-32

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --quantMode TranscriptomeSAM --quantTranscriptomeSAMoutput BanSingleEnd
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

for label in rustar STAR; do
    if [ $label = rustar ]; then G=RUS./Aligned.out.bam; T=RUS./Aligned.toTranscriptome.out.bam
    else                         G=STAR.Aligned.out.bam; T=STAR.Aligned.toTranscriptome.out.bam; fi
    g_hdr=$(docker run --rm -v $PWD:/w $STAR samtools view -H /w/$G | grep -c '^@RG')
    g_rec=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$G | grep -c 'RG:Z:')
    t_hdr=$(docker run --rm -v $PWD:/w $STAR samtools view -H /w/$T | grep -c '^@RG')
    t_rec=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$T | grep -c 'RG:Z:')
    t_total=$(docker run --rm -v $PWD:/w $STAR samtools view -c /w/$T)
    printf "%-6s genome: @RG=%d, RG:Z:=%d | transcriptome: @RG=%d, RG:Z:=%d / %d\n" \
        $label $g_hdr $g_rec $t_hdr $t_rec $t_total
done

Observed (verified, fresh paired run)

rustar genome: @RG=1, RG:Z:=88592 | transcriptome: @RG=1, RG:Z:=0     / 77300
STAR   genome: @RG=1, RG:Z:=88072 | transcriptome: @RG=1, RG:Z:=78886 / 78886

Both aligners write the @RG header into the transcriptome BAM, but only STAR writes the per-record tag.

Suggested fix

In src/lib.rs::build_transcriptome_records_pe (and its SE counterpart), after building each record, stamp the RG tag from the user-supplied --outSAMattrRGline. The equivalent of STAR's Parameters_samAttributes.cpp:201-205 — adding ATTR_RG to the transcriptome attribute-order list whenever it's added to the genome list.

Alternatively, lift the RG-tag-write logic out of the genome-record builder into a shared helper that both paths call. Same general direction as the suggested fix for #22 (extracting paired-record bookkeeping into a shared helper).

Test plan

Once the fix lands:

  1. Run rustar with --outSAMattrRGline ID:foo SM:bar and --quantMode TranscriptomeSAM.
  2. Assert samtools view <transcriptome.bam> | grep -c 'RG:Z:foo' equals the total transcriptome record count.
  3. Cross-check the value matches the ID in the @RG header line.

Severity

Low. Salmon's alignment-mode quant ignores RG so nf-core/rnaseq isn't directly broken by this. But this combines with #22 transcriptome BAM mate-fields and other transcriptome-side gaps as the trio of "transcriptome BAM is a less complete sibling of the genome BAM". Worth tightening so the transcriptome BAM is byte-symmetric with the genome BAM for shared fields.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions