Skip to content

pass-1 alignment doesn't seed candidates from --sjdbGTFfile; ~50% of GT/AG splices missed even after #27 #47

@pinin4fjords

Description

@pinin4fjords

Summary

Pass-1 alignment doesn't use GTF-derived junctions as alignment candidates. Even after #27 (annotation lookup at scoring time) lands, total splice count stays around half STAR's on the same data, with the non-canonical rate elevated — rustar is only discovering junctions from the reads themselves, where STAR also seeds pass-1 from the GTF-derived Gsj buffer baked into the SA-extended genome at index time.

Log.final.out field rustar STAR 2.7.11b
Number of splices: Total 366 801
Number of splices: Annotated (sjdb) 0 (95 after #27) 620
Number of splices: GT/AG 266 686
Number of splices: Non-canonical 93 108

Read alignments that would extend across a GTF junction fall through to the stricter unannotated gate and either map as unspliced (CIGAR collapse to 101M) or fall to Reads unmapped: too short. Per-read example, fragment SRR6357072.32572100:

  • STAR: 96M14674N5M crossing one of the GTF-annotated yeast introns. NH=2.
  • rustar: 101M at the same exonic start. NH=1.

STAR reference behaviour

STAR's pass-1 uses the sjdb-extended genome — the Gsj buffer appended to the genome and indexed in the SA during genomeGenerate (see source/sjdbPrepare.cpp and source/sjdbBuildIndex.cpp). Seeds that land in the Gsj region get translated back to the corresponding genome donor/acceptor pair and the splice becomes a candidate junction without the read having to bridge the gap itself.

rustar's src/junction/sjdb_insert.rs ports the sjdb-into-Gsj logic at genomeGenerate time, and the startup log confirms the junctions are loaded:

[INFO  rustar_aligner::junction] Extracted 5 annotated junctions from GTF
[INFO  rustar_aligner::index::io] Junction database loaded: 5 annotated junctions

So the index machinery is there. The gap is at align time: SA hits in the Gsj region are dropped at Genome::position_to_chr (returns None for positions in the Gsj buffer) instead of being decoded into splice candidates that propagate into the seed → cluster → stitch pipeline.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-47 && cd /tmp/rustar-mre-47

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar splices ==="; grep -E 'splices' RUS./Log.final.out
echo "=== STAR splices ==="   ; grep -E 'splices' STAR.Log.final.out

Suggested fix

At alignment time, SA hits in the [n_genome_real, n_genome) region (i.e. inside Gsj) need to be decoded back into (donor_fwd, acceptor_fwd) genome positions and added as splice candidates feeding the seed → cluster → stitch DP. Pseudocode mirroring STAR's source/seedSearchStart.cpp / source/Aligner.cpp:

for each SA hit:
    if hit_pos >= n_genome_real:
        (donor, acceptor) = decode_gsj_hit(hit_pos)
        emit splice_candidate(donor, acceptor)

Concretely: add a decode_gsj_hit helper in src/junction/sjdb_insert.rs (or wherever the Gsj layout is owned), expose n_genome_real on Genome, and route hits through the helper in cluster_seeds / equivalent.

Severity

Medium. Mapping-rate impact on the yeast test profile is ~0.2 pp (~50 of 49 551 reads per sample shift from unique → unmapped). Per-read CIGAR impact is bigger: 234 of 340 same-position-different-CIGAR reads on WT_REP2 are this pattern. Scales with how many reads in the input cross GTF junctions — yeast has 1.5 %, mammalian samples much more.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions