1- # salmon 2.0.2 (draft — in progress)
1+ # salmon 2.1.0 (draft — in progress)
22
3- A correctness-focused patch that closes the remaining selective-alignment
4- mapping/quantification gaps against C++ salmon (1.12.1), plus the uni-MEM default
5- seeding change. ** No output-format changes** — ` quant.sf ` and inferential
6- replicates are produced as before. Indices built with 2.0.0/2.0.1 still load,
7- but rebuilding is recommended to pick up the uni-MEM seeding.
3+ A correctness-focused release that closes the remaining selective-alignment
4+ mapping/quantification gaps against C++ salmon (1.12.1), adds N-aware decoy
5+ indexing, and introduces an explicit salmon index * format* version. Released as a
6+ ** minor** (not patch) version because of the index-format bump and the breadth of
7+ the decoy / short-transcript / duplicate-symmetry correctness changes.
8+
9+ ** Index rebuild required.** salmon 2.1.0 writes (and requires) index format
10+ ** v1** , recorded as ` index_version ` in ` info.json ` . Indices built by salmon
11+ 2.0.0/2.0.1 carry no format version and are ** rejected on load** with a rebuild
12+ message — they predate the decoy / short-transcript contiguity guarantee below
13+ and could mis-classify references. ` quant.sf ` and the inferential-replicate
14+ formats are otherwise unchanged.
15+
16+ ## Index format version (` index_version ` )
17+
18+ salmon now records its own on-disk index ** format version** in ` info.json ` ,
19+ independent of the software-release string (` salmon_version ` ) and of the
20+ underlying piscem/cf1-rs versions. ` SalmonIndex::load ` refuses to open an index
21+ older than the minimum it supports (currently v1) and prints an actionable
22+ ` salmon index … ` rebuild command instead of risking a silent mis-load. The field
23+ is bumped only when a layout/semantics change makes an older index unsafe to
24+ read, so version going forward is explicit rather than inferred.
825
926These changes were found and validated by a head-to-head parity study against
1027C++ salmon on simulated human data (polyester, 193,759 transcripts); see
@@ -94,8 +111,8 @@ defects are fixed:
94111 correct. On SRR1039508 (3 M-read subset, ` --seqBias --gcBias ` ) the abundance
95112 phase now completes in seconds instead of stalling, and the reported mapping
96113 rate drops from an inflated 98.8 % to ** 93.7 %** , matching C++ salmon's 94.1 %.
97- Indices built before 2.0.2 ** must be rebuilt** to pick up the corrected
98- metadata (older indices fall back to the previous suffix interpretation ).
114+ Indices built before 2.1.0 ** must be rebuilt** (they are now rejected on load by
115+ the index-format-version check above, rather than silently mis-interpreted ).
99116
100117- ** Short transcripts in ` quant.sf ` .** The sub-` k ` short transcripts are recorded
101118 by name and length and reported in ` quant.sf ` with 0 reads / 0 TPM (rather than
@@ -118,6 +135,73 @@ PR #1020 is also folded in.
118135 and reported at 0 reads. A build-time guard verifies decoy contiguity and aborts
119136 with a clear error rather than risk a silent mis-classification.
120137
138+ - ** Deterministic, input-order index build.** The cDBG builder (cf1-rs) now emits
139+ its reference tiling in input order via a bounded reorder buffer
140+ (` synchronize_output ` ), so the built reference numbering is deterministic and
141+ decoys — always last in the input — stay one contiguous block by construction
142+ (the prior task-completion order could scatter a small decoy among transcripts).
143+ This makes the O(1) decoy range check and the contiguity guard correct by design.
144+
145+ ## N-aware decoy indexing (cf1-rs 0.5)
146+
147+ Decoy sequences now ** retain their ambiguous (` N ` ) bases** instead of having them
148+ replaced with pseudo-random ACGT. cf1-rs splits the de Bruijn graph on ` N ` runs
149+ natively (recording the gaps in the tiling), which avoids seeding spurious k-mers
150+ across assembly gaps and yields a less tangled, smaller, faster-to-build graph;
151+ the reference store keeps the raw bytes and the aligner encodes ` N ` as a mismatch
152+ (dna5 code 4). Transcripts are still ` N ` -replaced (matching salmon ` FixFasta ` ). On
153+ the full GRCh38 gentrome (≈5 % N) this removes ~ 151 M spurious k-mers (−5.7 %),
154+ shrinks the index ~ 1.3 %, and trims build time/peak-RSS a couple percent; the
155+ savings scale with N content. (` cf1-rs ` ≥ 0.5 also adds ` --poly-N-stretch ` gating:
156+ without it an ` N ` -containing input now fails loudly rather than corrupting.)
157+
158+ ## ` duplicate_clusters.tsv ` under ` --keepDuplicates `
159+
160+ salmon detects exact-sequence-duplicate transcripts and lists them in
161+ ` duplicate_clusters.tsv ` even when ` --keepDuplicates ` retains them (downstream
162+ tooling relies on the file). The Rust port had gated both the detection and the
163+ file emission behind * not* keeping duplicates, so a ` --keepDuplicates ` index had no
164+ ` duplicate_clusters.tsv ` at all. Duplicates are now detected in both modes (the
165+ cluster list is populated regardless), only * collapsed* when not keeping
166+ duplicates, and the file is always written. (#1015 )
167+
168+ ## Decoy handling in ` --sketch ` (pseudoalignment) mode
169+
170+ Decoys are now handled in sketch mode, where they previously ** leaked** . Sketch
171+ mappings are built directly from piscem's accepted hits and bypassed the
172+ selective-alignment finalize where all decoy logic lived, so on a decoy-aware
173+ index decoy references entered the equivalence classes as if they were
174+ transcripts: decoy-only fragments were counted as mapped, decoys stole EM mass,
175+ and ` num_decoy_fragments ` was never recorded. On SRR1039508 (3 M-read subset,
176+ GRCh38 + 194 decoys) this inflated the sketch mapping rate to ** 97.5 %** with
177+ ** 0** decoy fragments reported.
178+
179+ Sketch mode now applies the same decoy policy as selective alignment: decoy-only
180+ fragments are dropped and counted as decoy, and decoy targets are removed from the
181+ equivalence class. The corrected rate is ** 92.6 %** with ** 147,190** decoy
182+ fragments — in line with selective alignment (93.7 %) — and removing the decoy
183+ mass also recovers real-transcript abundances (nonzero transcripts 49.5 k → 51.1 k,
184+ toward SA's 51.8 k).
185+
186+ - ** ` --decoyThreshold ` is a no-op in sketch mode** (warned if set): pseudoalignment
187+ returns only equally-best mappings, so the ` bestTxp < decoyThreshold * bestDecoy `
188+ comparison never triggers. A fragment is decoy-dominated only when it maps to
189+ decoys * and no transcript* .
190+ - ** ` --allowDecoyOrphans ` works in sketch mode.** Because sketch pairing is
191+ same-tid only, a fragment with one mate on a transcript and the other on a decoy
192+ forms no concordant pair and would otherwise be dropped as unmapped. With the
193+ flag, the transcript mate is recovered as an orphan (only when the other mate's
194+ hits are entirely decoys). On the subset above this recovers +8,270 fragments
195+ (92.6 % → 92.9 %), matching SA mode's ` --allowDecoyOrphans ` effect.
196+
197+ ## Mapper allocation/perf
198+
199+ The alignment/chaining hot path reuses per-thread ksw2 scratch buffers (a reusable
200+ aligner plus DNA5-encoded query/target buffers) instead of allocating per call, and
201+ ` chain_mems ` gains exact single-MEM and two-MEM fast paths for the common small
202+ cases. These are score/result-preserving; the two-MEM fast path carries the same
203+ contained-anchor guard as the general DP. (#1015 )
204+
121205## Duplicate-transcript symmetry: consistent fragment-length reads
122206
123207For sets of exact-duplicate transcripts the per-member read split is statistically
0 commit comments