Skip to content

Commit a198954

Browse files
rob-pclaude
andcommitted
docs(2.1.0): document sketch-mode decoy handling + flag behavior
Add a "Decoy handling in --sketch mode" section: the leak fix (97.5% -> 92.6%, 0 -> 147,190 decoy fragments, abundance recovery), --decoyThreshold being a no-op in sketch, and --allowDecoyOrphans recovering transcript orphans (+8,270 frags). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01B7JMur5DmDpECddErpi2JS
1 parent ed5ae6a commit a198954

1 file changed

Lines changed: 92 additions & 8 deletions

File tree

docs/release-notes-2.1.0.md

Lines changed: 92 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,27 @@
1-
# salmon 2.0.2 (draft — in progress)
1+
# salmon 2.1.0 (draft — in progress)
22

3-
A correctness-focused patch that closes the remaining selective-alignment
4-
mapping/quantification gaps against C++ salmon (1.12.1), plus the uni-MEM default
5-
seeding change. **No output-format changes**`quant.sf` and inferential
6-
replicates are produced as before. Indices built with 2.0.0/2.0.1 still load,
7-
but rebuilding is recommended to pick up the uni-MEM seeding.
3+
A correctness-focused release that closes the remaining selective-alignment
4+
mapping/quantification gaps against C++ salmon (1.12.1), adds N-aware decoy
5+
indexing, and introduces an explicit salmon index *format* version. Released as a
6+
**minor** (not patch) version because of the index-format bump and the breadth of
7+
the decoy / short-transcript / duplicate-symmetry correctness changes.
8+
9+
**Index rebuild required.** salmon 2.1.0 writes (and requires) index format
10+
**v1**, recorded as `index_version` in `info.json`. Indices built by salmon
11+
2.0.0/2.0.1 carry no format version and are **rejected on load** with a rebuild
12+
message — they predate the decoy / short-transcript contiguity guarantee below
13+
and could mis-classify references. `quant.sf` and the inferential-replicate
14+
formats are otherwise unchanged.
15+
16+
## Index format version (`index_version`)
17+
18+
salmon now records its own on-disk index **format version** in `info.json`,
19+
independent of the software-release string (`salmon_version`) and of the
20+
underlying piscem/cf1-rs versions. `SalmonIndex::load` refuses to open an index
21+
older than the minimum it supports (currently v1) and prints an actionable
22+
`salmon index …` rebuild command instead of risking a silent mis-load. The field
23+
is bumped only when a layout/semantics change makes an older index unsafe to
24+
read, so version going forward is explicit rather than inferred.
825

926
These changes were found and validated by a head-to-head parity study against
1027
C++ salmon on simulated human data (polyester, 193,759 transcripts); see
@@ -94,8 +111,8 @@ defects are fixed:
94111
correct. On SRR1039508 (3 M-read subset, `--seqBias --gcBias`) the abundance
95112
phase now completes in seconds instead of stalling, and the reported mapping
96113
rate drops from an inflated 98.8 % to **93.7 %**, matching C++ salmon's 94.1 %.
97-
Indices built before 2.0.2 **must be rebuilt** to pick up the corrected
98-
metadata (older indices fall back to the previous suffix interpretation).
114+
Indices built before 2.1.0 **must be rebuilt** (they are now rejected on load by
115+
the index-format-version check above, rather than silently mis-interpreted).
99116

100117
- **Short transcripts in `quant.sf`.** The sub-`k` short transcripts are recorded
101118
by name and length and reported in `quant.sf` with 0 reads / 0 TPM (rather than
@@ -118,6 +135,73 @@ PR #1020 is also folded in.
118135
and reported at 0 reads. A build-time guard verifies decoy contiguity and aborts
119136
with a clear error rather than risk a silent mis-classification.
120137

138+
- **Deterministic, input-order index build.** The cDBG builder (cf1-rs) now emits
139+
its reference tiling in input order via a bounded reorder buffer
140+
(`synchronize_output`), so the built reference numbering is deterministic and
141+
decoys — always last in the input — stay one contiguous block by construction
142+
(the prior task-completion order could scatter a small decoy among transcripts).
143+
This makes the O(1) decoy range check and the contiguity guard correct by design.
144+
145+
## N-aware decoy indexing (cf1-rs 0.5)
146+
147+
Decoy sequences now **retain their ambiguous (`N`) bases** instead of having them
148+
replaced with pseudo-random ACGT. cf1-rs splits the de Bruijn graph on `N` runs
149+
natively (recording the gaps in the tiling), which avoids seeding spurious k-mers
150+
across assembly gaps and yields a less tangled, smaller, faster-to-build graph;
151+
the reference store keeps the raw bytes and the aligner encodes `N` as a mismatch
152+
(dna5 code 4). Transcripts are still `N`-replaced (matching salmon `FixFasta`). On
153+
the full GRCh38 gentrome (≈5 % N) this removes ~151 M spurious k-mers (−5.7 %),
154+
shrinks the index ~1.3 %, and trims build time/peak-RSS a couple percent; the
155+
savings scale with N content. (`cf1-rs` ≥ 0.5 also adds `--poly-N-stretch` gating:
156+
without it an `N`-containing input now fails loudly rather than corrupting.)
157+
158+
## `duplicate_clusters.tsv` under `--keepDuplicates`
159+
160+
salmon detects exact-sequence-duplicate transcripts and lists them in
161+
`duplicate_clusters.tsv` even when `--keepDuplicates` retains them (downstream
162+
tooling relies on the file). The Rust port had gated both the detection and the
163+
file emission behind *not* keeping duplicates, so a `--keepDuplicates` index had no
164+
`duplicate_clusters.tsv` at all. Duplicates are now detected in both modes (the
165+
cluster list is populated regardless), only *collapsed* when not keeping
166+
duplicates, and the file is always written. (#1015)
167+
168+
## Decoy handling in `--sketch` (pseudoalignment) mode
169+
170+
Decoys are now handled in sketch mode, where they previously **leaked**. Sketch
171+
mappings are built directly from piscem's accepted hits and bypassed the
172+
selective-alignment finalize where all decoy logic lived, so on a decoy-aware
173+
index decoy references entered the equivalence classes as if they were
174+
transcripts: decoy-only fragments were counted as mapped, decoys stole EM mass,
175+
and `num_decoy_fragments` was never recorded. On SRR1039508 (3 M-read subset,
176+
GRCh38 + 194 decoys) this inflated the sketch mapping rate to **97.5 %** with
177+
**0** decoy fragments reported.
178+
179+
Sketch mode now applies the same decoy policy as selective alignment: decoy-only
180+
fragments are dropped and counted as decoy, and decoy targets are removed from the
181+
equivalence class. The corrected rate is **92.6 %** with **147,190** decoy
182+
fragments — in line with selective alignment (93.7 %) — and removing the decoy
183+
mass also recovers real-transcript abundances (nonzero transcripts 49.5 k → 51.1 k,
184+
toward SA's 51.8 k).
185+
186+
- **`--decoyThreshold` is a no-op in sketch mode** (warned if set): pseudoalignment
187+
returns only equally-best mappings, so the `bestTxp < decoyThreshold * bestDecoy`
188+
comparison never triggers. A fragment is decoy-dominated only when it maps to
189+
decoys *and no transcript*.
190+
- **`--allowDecoyOrphans` works in sketch mode.** Because sketch pairing is
191+
same-tid only, a fragment with one mate on a transcript and the other on a decoy
192+
forms no concordant pair and would otherwise be dropped as unmapped. With the
193+
flag, the transcript mate is recovered as an orphan (only when the other mate's
194+
hits are entirely decoys). On the subset above this recovers +8,270 fragments
195+
(92.6 % → 92.9 %), matching SA mode's `--allowDecoyOrphans` effect.
196+
197+
## Mapper allocation/perf
198+
199+
The alignment/chaining hot path reuses per-thread ksw2 scratch buffers (a reusable
200+
aligner plus DNA5-encoded query/target buffers) instead of allocating per call, and
201+
`chain_mems` gains exact single-MEM and two-MEM fast paths for the common small
202+
cases. These are score/result-preserving; the two-MEM fast path carries the same
203+
contained-anchor guard as the general DP. (#1015)
204+
121205
## Duplicate-transcript symmetry: consistent fragment-length reads
122206

123207
For sets of exact-duplicate transcripts the per-member read split is statistically

0 commit comments

Comments
 (0)