Skip to content

Log.final.out folds all unmapped reads into too short; other bucket always 0 #48

@pinin4fjords

Description

@pinin4fjords

Summary

Log.final.out reports two categories of unmapped reads — Number of reads unmapped: too short and Number of reads unmapped: other — using STAR's exact field labels. rustar folds both buckets into too short and always reports other = 0. Total unmapped count is preserved, but a downstream tool that parses the two fields separately (MultiQC's STAR module, custom QC dashboards) gets an inflated too short percentage and a zero other percentage.

On WT_REP2 (nf-core/rnaseq test profile):

Field STAR 2.7.11b rustar 0.1.0
Number of reads unmapped: too many mismatches 0 0
Number of reads unmapped: too short 1 540 (3.11 %) 4 193 (8.46 %)
Number of reads unmapped: other 2 656 (5.36 %) 0 (0.00 %)
Total unmapped 4 196 4 193

The totals match within RNG / tie-breaking tolerance — rustar isn't dropping reads. It's the bucketing that's wrong: every read STAR would label other (no seeds / no clusters / failed-stitch reason) is being mislabelled as too short (post-stitch length / score / match filter).

STAR reference behaviour

STAR's bucketing (per source/ReadAlign_assignAlignToWindow.cpp and the nUnmapped arrays in source/Stats.cpp):

  • too short: a transcript was generated but filtered out by score / match-count / read-length-coverage filters (outFilterScoreMin*, outFilterMatchNmin*).
  • other: no seeds, no clusters, no transcript — read couldn't be aligned at all.
  • too many mismatches: filtered by outFilterMismatchN*.

rustar's src/stats.rs::UnmappedReason enum has all four variants:

pub enum UnmappedReason {
    Other,
    TooShort,
    TooManyMismatches,
    TooManyLoci,
}

So the categorization exists in the code. The bug is at the classification call sites — wherever an unmapped read is recorded, the code is calling record_unmapped_reason(UnmappedReason::TooShort) instead of UnmappedReason::Other for reads that have no alignments at all.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-unmapped && cd /tmp/rustar-mre-unmapped

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar unmapped buckets ==="
grep -E 'unmapped' RUS./Log.final.out
echo
echo "=== STAR unmapped buckets ==="
grep -E 'unmapped' STAR.Log.final.out

Suggested investigation

Find every site that records an unmapped read in rustar — likely a record_unmapped_reason(UnmappedReason::...) call. Trace which one fires for reads that fail at different pipeline stages:

  • No seeds found → should be Other
  • Seeds found but no clusters / no extension → should be Other
  • Transcripts generated but filtered by outFilterScoreMin* / outFilterMatchNmin* → should be TooShort
  • Filtered by outFilterMismatchN* → should be TooManyMismatches

The fact that rustar's Other is always 0 strongly suggests record_unmapped_reason is never being called with UnmappedReason::Other — either the no-seeds / no-clusters path is silently falling through to the TooShort bucket, or the Other arm of the enum is unreachable in the current code.

Severity

Low-to-medium. Doesn't affect mapping quality or alignment correctness — totals match STAR. Affects downstream QC reading (MultiQC's STAR module pulls both fields; the inflated too short bar and zero other bar would look anomalous to a user comparing samples processed by different aligners). Anything that summarises "fraction of reads unmappable for non-trivial reasons" (STAR's other bucket captures harder-to-rescue reads) loses that signal entirely.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions