pass-1 alignment doesn't seed candidates from `--sjdbGTFfile`; ~50% of GT/AG splices missed even after #27

## Summary

Pass-1 alignment doesn't use GTF-derived junctions as alignment candidates. Even after [#27](https://github.com/scverse/rustar-aligner/issues/27) (annotation lookup at scoring time) lands, total splice count stays around half STAR's on the same data, with the non-canonical rate elevated — rustar is only discovering junctions from the reads themselves, where STAR also seeds pass-1 from the GTF-derived `Gsj` buffer baked into the SA-extended genome at index time.

| `Log.final.out` field | rustar | STAR 2.7.11b |
|---|---:|---:|
| `Number of splices: Total`              | 366 | 801 |
| `Number of splices: Annotated (sjdb)`   | 0 (95 after #27) | 620 |
| `Number of splices: GT/AG`              | 266 | 686 |
| `Number of splices: Non-canonical`      | 93  | 108 |

Read alignments that would extend across a GTF junction fall through to the stricter unannotated gate and either map as unspliced (CIGAR collapse to `101M`) or fall to `Reads unmapped: too short`. Per-read example, fragment `SRR6357072.32572100`:

- STAR: `96M14674N5M` crossing one of the GTF-annotated yeast introns. NH=2.
- rustar: `101M` at the same exonic start. NH=1.

## STAR reference behaviour

STAR's pass-1 uses the sjdb-extended genome — the `Gsj` buffer appended to the genome and indexed in the SA during `genomeGenerate` (see [`source/sjdbPrepare.cpp`](https://github.com/alexdobin/STAR/blob/master/source/sjdbPrepare.cpp) and [`source/sjdbBuildIndex.cpp`](https://github.com/alexdobin/STAR/blob/master/source/sjdbBuildIndex.cpp)). Seeds that land in the `Gsj` region get translated back to the corresponding genome donor/acceptor pair and the splice becomes a candidate junction without the read having to bridge the gap itself.

rustar's `src/junction/sjdb_insert.rs` ports the sjdb-into-`Gsj` logic at `genomeGenerate` time, and the startup log confirms the junctions are loaded:

```
[INFO  rustar_aligner::junction] Extracted 5 annotated junctions from GTF
[INFO  rustar_aligner::index::io] Junction database loaded: 5 annotated junctions
```

So the index machinery is there. The gap is at align time: SA hits in the `Gsj` region are dropped at `Genome::position_to_chr` (returns `None` for positions in the Gsj buffer) instead of being decoded into splice candidates that propagate into the seed → cluster → stitch pipeline.

## Reproducer

```bash
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-47 && cd /tmp/rustar-mre-47

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar splices ==="; grep -E 'splices' RUS./Log.final.out
echo "=== STAR splices ==="   ; grep -E 'splices' STAR.Log.final.out
```

## Suggested fix

At alignment time, SA hits in the `[n_genome_real, n_genome)` region (i.e. inside `Gsj`) need to be decoded back into `(donor_fwd, acceptor_fwd)` genome positions and added as splice candidates feeding the seed → cluster → stitch DP. Pseudocode mirroring STAR's `source/seedSearchStart.cpp` / `source/Aligner.cpp`:

```
for each SA hit:
    if hit_pos >= n_genome_real:
        (donor, acceptor) = decode_gsj_hit(hit_pos)
        emit splice_candidate(donor, acceptor)
```

Concretely: add a `decode_gsj_hit` helper in `src/junction/sjdb_insert.rs` (or wherever the Gsj layout is owned), expose `n_genome_real` on `Genome`, and route hits through the helper in `cluster_seeds` / equivalent.

## Severity

Medium. Mapping-rate impact on the yeast test profile is ~0.2 pp (~50 of 49 551 reads per sample shift from unique → unmapped). Per-read CIGAR impact is bigger: 234 of 340 same-position-different-CIGAR reads on `WT_REP2` are this pattern. Scales with how many reads in the input cross GTF junctions — yeast has 1.5 %, mammalian samples much more.

---

<sub>Filed during nf-core/rnaseq integration testing ([nf-core/rnaseq#1855](https://github.com/nf-core/rnaseq/pull/1855)).</sub>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pass-1 alignment doesn't seed candidates from `--sjdbGTFfile`; ~50% of GT/AG splices missed even after #27 #47

Summary

STAR reference behaviour

Reproducer

Suggested fix

Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

`Log.final.out` field	rustar	STAR 2.7.11b
`Number of splices: Total`	366	801
`Number of splices: Annotated (sjdb)`	0 (95 after #27)	620
`Number of splices: GT/AG`	266	686
`Number of splices: Non-canonical`	93	108

pass-1 alignment doesn't seed candidates from --sjdbGTFfile; ~50% of GT/AG splices missed even after #27 #47

Description

Summary

STAR reference behaviour

Reproducer

Suggested fix

Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

pass-1 alignment doesn't seed candidates from `--sjdbGTFfile`; ~50% of GT/AG splices missed even after #27 #47