Skip to content

Add rna command: RNA-seq QC metrics (CollectRnaSeqMetrics + transcript insert size)#40

Open
tfenne wants to merge 20 commits into
mainfrom
tf_rna
Open

Add rna command: RNA-seq QC metrics (CollectRnaSeqMetrics + transcript insert size)#40
tfenne wants to merge 20 commits into
mainfrom
tf_rna

Conversation

@tfenne

@tfenne tfenne commented Jun 27, 2026

Copy link
Copy Markdown
Member

Summary

Adds the rna command: a single-pass, single-gene-model-load RNA-seq QC tool. It started as a port of Picard CollectRnaSeqMetrics (base composition, strand specificity, transcript 5'/3' coverage bias and CV) + fgbio EstimateRnaSeqInsertSize (transcript-space insert size, introns collapsed), and has grown into a superset that also covers the RSeQC / RNA-SeQC / Qualimap families: read accounting and genomic origin, gene detection, splice-junction annotation, transcript integrity (TIN), and per-biotype read counts. The collector is wired into multi.

riker rna -i sample.bam -o out_prefix --gene-model gencode.gtf.gz

Outputs (house conventions — lowercase headers, frac_, no metadata lines):
out.rna-metrics.txt (one wide row), out.rna-biotype.txt(+.pdf), out.rna-gene-expression.pdf, out.rna-coverage.pdf, out.rna-insert-size.txt / -histogram.txt(+.pdf).

Metrics (one rna-metrics.txt row)

  • Read accounting & genomic origin — total / mapped / unmapped / ignored / duplicate-rate; exonic / intronic / intergenic / ambiguous / ribosomal, with assigned_reads and fractions.
  • Gene detectiongenes_detected (genes with ≥ --genes-detected-min-reads).
  • Splice junctions — known / partial / novel observations and distinct junctions vs. the annotation, plus frac_known_juncs.
  • Base composition — coding / UTR / intronic / intergenic / ribosomal bases and fractions, frac_mrna_bases, frac_usable_bases.
  • Strandedness — detected strand, correct-strand fraction, R1/R2 transcription-strand counts.
  • Coverage uniformity — median CV, 5'/3'/5'→3' bias, and TIN (median / mean / stddev / transcript count).

Highlights

  • Multiple gene-model formats, auto-detected from content: UCSC refFlat, GTF, GFF3 (GENCODE / Ensembl / RefSeq), with contig-name reconciliation (chr add/strip, MTchrM, RefSeq NC_* → common name). Gzip handled transparently.
  • Strandedness auto-detection (--strand auto, default) from observed R1/R2 transcription-strand counts; unstranded/forward/reverse force a model.
  • Ribosomal regions = union of biotype-derived rRNA/Mt_rRNA genes and an optional --ribosomal-intervals file (BED or IntervalList).
  • Transcript-space insert size per pair orientation (FR/RF/tandem). Uses the MC (mate CIGAR) tag when present (exact, any orientation); when MC is absent it estimates FR pairs from a spec-compliant TLEN at the forward read rather than dropping them — see below.
  • Plots always produced, Fulcrum-branded: per-biotype bar chart, per-gene RPK expression distribution (all genes vs. protein-coding), normalized transcript coverage, and the insert-size histogram (reusing the shared isize helper so the two match).

TIN scale & validation

riker's TIN is the largest deliberate numeric difference from the reference tools: it scores one well-covered transcript per gene, so its median runs ~14–16 points higher than RSeQC/RustQC, which score every isoform. The two report the same shape on a different scale. This is validated in docs/rna-integrity.md on a controlled RNA-degradation ladder (Sigurgeirsson et al. 2014, RIN 10→2) plus a depth-downsampling sweep: both implementations track degradation in lockstep, and riker is markedly more depth-robust (≈0.5% vs ≈4% TIN change across a 5× depth reduction). Indexing and alignment were both done with rustar-aligner against GENCODE v50, for a single-tool story.

Intentional differences vs. Picard/fgbio (documented in ERRATA.md)

Deliberate corrections/choices, not regressions; on identical inputs read/base accounting matches Picard to the base and composition agrees within ~0.5%:

  • Coverage/bias fixes: top-1000 transcripts (Picard keeps 1001 via an off-by-one); short transcripts (< 2×end_bias_bases) excluded from the CV/bias set; every aligned base counted (Picard drops the last base of each block); non-overlapping normalized-position bins.
  • Ribosomal fragment overlap: union of merged overlaps (Picard takes the single largest, Math.max); fragment end from the MC tag, falling back to TLEN, never read length.
  • Insert-size mate span — MC, else a TLEN best guess. With MC the mate's exact blocks are used (any orientation). Without it, FR pairs are kept by deriving the mate's right end from start + |TLEN| - 1 at the forward read: the 5'-to-5' transcript-space size is then exact, and the unknown mate splicing is gated by requiring both mate ends in an exon plus a ≥95% collapsed-exonic footprint. Validated against the MC path on a real BAM (9.7M FR pairs; mean/median agree within ~0.5%). A BAM with neither MC nor a valid TLEN yields no/incorrect insert sizes — samtools fixmate -m remains the way to guarantee exactness.
  • Insert-size acceptance: a pair is used when exactly one gene encloses the whole pair span (plus per-transcript overlap and agreement checks), rather than fgbio's "exactly one gene overlaps the span." A non-enclosing bystander gene near the pair no longer disqualifies it (~7% more pairs, <1–2 bp shift).
  • Gene → locus → transcript model: a gene's transcripts are grouped into loci (same contig/strand, overlapping spans), so a gene at disjoint locations becomes independent loci.
  • Input must be coordinate-sorted (@HD SO:coordinate); validated at startup, enabling the allocation-free locus sweep.

Dropped Picard's per-position *.rna-coverage.txt table — Picard emits it only because its plotting engine is external; riker plots internally, so only the .pdf is kept.

Performance & memory

Profiled and tuned after the correctness-complete implementation (composition + coverage verified byte-identical at every step on a 75.2M-read PE BAM, GENCODE v29):

  • Throughput ~250k → ~2M reads/s (≈7×) — linear-time per-base work, classify-once-per-locus, block×interval-arithmetic classification (no per-base array), and a coordinate sweep replacing the interval tree.
  • Peak RSS 2.16 GB → ~0.8 GB (≈2.75×) — stream the gene model line-by-line (fgoxide) instead of slurping ~1.2 GB into a String, and store coverage once per locus over its exon union (reconstructing per-transcript coverage at finish).

The metrics-expansion (read origin, junctions, biotype, TIN) added ~1.5% runtime. riker stays ~10× faster single-threaded than RustQC and far lighter. BGZF decompression (~11%) is the remaining single-threaded floor.

Testing & review

  • Unit + integration tests for the command and the three parsers (all data built programmatically; no checked-in fixtures), plus the standard suite — 639 tests pass, fmt / clippy --all-targets (pedantic) / nextest all clean.
  • Validated against Picard CollectRnaSeqMetrics (read accounting exact; composition ~0.5%; coverage/bias ~1.3%, in the corrected direction), cross-checked vs. RustQC (junctions, TIN, biotype), and the TIN dose-response validated end-to-end on real degradation-ladder data.

Out of scope (tracked separately)

A curated GRCh38 rDNA BED / rRNA interval list with methodology, plus per-biotype/expression refinements — to be their own changes.

tfenne added 13 commits June 26, 2026 18:40
…ipt insert size)

Ports Picard CollectRnaSeqMetrics (base composition, strand specificity,
5'/3' coverage bias) and fgbio EstimateRnaSeqInsertSize (transcript-space
insert size) into a single `riker rna` command that makes one BAM pass over
a single gene-model load. Also wired into `multi`.

Why a new gene_model module: both analyses need transcript/exon/CDS structure,
and we want to accept the formats users actually have rather than forcing a
refFlat conversion. `src/gene_model/` parses UCSC refFlat, GTF, and GFF3
(GENCODE/Ensembl/RefSeq) with content-based auto-detection and provider-agnostic
attribute handling, exposing a queryable GeneModel (Gene -> GeneLocus ->
Transcript; multi-chromosome genes become independent loci). Coordinates are
1-based inclusive internally to match both reference tools.

Several latent bugs in the reference tools are fixed and documented in the
rna module docs: top-1001 -> top-1000 transcript selection; ribosomal overlap
union (not Math.max); short-transcript exclusion from bias; per-base coverage
undercount; MC-or-TLEN fragment end; non-overlapping coverage-vs-position bins.

Filtering is unified and Picard-biased: duplicates included by default
(--exclude-duplicates to drop), --min-mapq default 0, strand auto-detected by
default. Ribosomal territory is the union of biotype-derived rRNA/Mt_rRNA loci
and an optional explicit BED/IntervalList. Transcript-space insert size
requires the MC tag (pairs lacking it are skipped, never approximated).

Shared work: the insert-size histogram chart is extracted to
plotting::write_insert_size_histogram_pdf and reused by `isize` and `rna`;
the gzip-aware text reader is shared as intervals::read_file_contents.

Tests: 49 new programmatic tests (gene-model parsing/auto-detection, base
classification across all three formats, ribosomal, strand auto-detect,
transcript-space insert size, coverage/bias edge cases, filtering).
NCBI RefSeq GFF3 files refer to contigs by accession (e.g. NC_000001.11)
rather than common names (chr1 / 1), so against a chr-named BAM every locus
was dropped by contig reconciliation. Read the accession->name mapping from
the `region`/`chromosome` feature attributes (`chromosome=`/`Name=`, as fgbio
does) and translate each transcript's contig through it; the existing chr/MT
reconciliation then resolves the common name to the BAM. Verified on the real
GCF_000001405.40_GRCh38.p14 GFF3 (0 -> 57,641 loci loaded against a GRCh38
chr-named RNA-seq BAM).
Every command derives its output paths by appending a suffix to the -o/--output
prefix, and only attempts the write in finish(). A missing or unwritable output
directory therefore surfaced only after the entire input had been processed —
e.g. a full RNA-seq pass would run for minutes and then fail at the end.

Add common::validate_output_prefix, which checks the prefix's parent directory
exists, is a directory, and is writable (by creating and removing a probe file),
and call it at the start of execute() for every command (alignment, basic,
error, gcbias, hybcap, isize, rna, wgs, multi). Unwritable destinations now fail
in <1s before any reading or gene-model parsing.
…d/fgbio

Add rna to the tools table and usage examples, and a 'Differences in rna vs.
CollectRnaSeqMetrics (and fgbio EstimateRnaSeqInsertSize)' section covering the
gene-model formats + contig reconciliation, biotype+explicit ribosomal regions,
strand auto-detection, transcript-space insert size, the coverage/bias off-by-one
fixes (top-1000, short-transcript exclusion, per-base coverage, non-overlapping
position bins), the ribosomal union/MC fragment-end fixes, the gene->locus->transcript
model, and the duplicate/MAPQ/accumulation-level behavior.
classify_bases dominated runtime (~80% of samples). assign_locus_function
and add_coverage_counts each rescanned the full exon list for every aligned
base — O(positions x exons) per transcript per read. Rewrite both as single
passes (a moving exon pointer for classification, contiguous range fills for
coverage), turning them into O(positions + exons). Also reuse a per-collector
scratch buffer for the locus-function array instead of allocating one per
alignment block.

Outputs are byte-identical on the PE test set; throughput on a 16M-read BAM
with the GENCODE v29 GTF goes from ~175k to ~570k reads/s.
Per-base functional classification ran assign_locus_function for every
transcript of every overlapping locus, so its cost scaled with the number of
transcripts per gene (~45% of runtime after the previous fix). The result is
just the max function across transcripts, so precompute each locus's merged
exon and coding (exon ∩ CDS) interval unions at load time and classify each
block once per locus via GeneLocus::classify_block. Per-transcript coverage
accumulation is unchanged (it genuinely needs per-transcript detail).

Outputs are byte-identical on the PE test set; throughput on the 16M-read
GENCODE v29 run goes from ~570k to ~900k reads/s.
Replace the per-base locus-function array (classify_block + the per-block funcs
scratch buffer) with interval arithmetic: intersect each alignment block with
the merged coding / exon / span interval sets and derive the per-class base
counts directly (coding = block∩coding; utr = block∩exon − coding; intronic =
block∩span − exon; intergenic = block − span).

For the common 0/1-locus read the locus's precomputed unions are used directly;
multi-locus reads merge the loci's unions first, so overlapping genes compose by
union — exactly reproducing the cross-locus max the per-base path computed.
Output is byte-identical.

Full PE dataset (75.2M reads, GENCODE v29 GTF): 60s -> 49s (~1.25M -> ~1.57M
reads/s). Adds direct unit tests for classify_blocks/overlap_with_intervals.
Replace the gene-model interval tree with a coordinate sweep over the
(now-required) coordinate-sorted reads. Loci are indexed per contig sorted by
start; the collector opens loci as their start passes the read end and drops
them once their end falls behind the read start, filtering the retained set to
those actually overlapping the read (a long spliced read can open downstream
loci). initialize() now requires @hd SO:coordinate and errors otherwise, with a
runtime monotonicity guard as backstop.

Insert size now derives its gene from the first read's swept overlaps and
accepts a pair when exactly one of them *encloses* the pair span (plus the
existing per-transcript exon-overlap and agreement checks), instead of fgbio's
'exactly one gene overlaps the pair span'. A non-enclosing bystander gene near
the pair no longer disqualifies it: on the full PE set this uses 6.7% more pairs
(12.74M -> 13.60M) with mean +1% (208.9 -> 210.9), median 195 -> 197.
Composition and coverage outputs are byte-identical.

Full PE dataset (75.2M reads): 49s -> 43s (~1.57M -> ~1.75M reads/s). Drops the
per-read locus allocation and the separate insert-size overlap query.
…insert size

The perf work added a coordinate-sorted input requirement (the locus sweep) and
changed the insert-size acceptance rule to 'exactly one gene encloses the pair'
(vs fgbio's 'one gene overlaps the pair span'). Document both in the rna
differences section.
read_file_contents read the entire gene model into one String (~1.2 GB
uncompressed for GENCODE v29) before parsing. Open it with fgoxide's
Io::new_reader (transparent gzip by extension) and stream line by line: the
three parsers now take impl BufRead and iterate .lines(); format detection reads
a throwaway reader and parsing re-opens a fresh one. Eliminates the multi-
hundred-MB-to-GB init-time allocation; output unchanged.
Coverage was stored as a Vec<u32> per (locus, transcript), so a base shared by N
transcripts of a gene was accumulated N times and stored N times — 316.7M u32
(1.27 GB) for GENCODE v29. Store one coverage array per locus over its
concatenated exon union (introns removed, shared exons stored once: 140.2M u32,
0.56 GB) and accumulate into it once per overlapping locus. Per-transcript
coverage for the bias/CV/histogram metrics is reconstructed at finish by mapping
each transcript's exons back to their exon-union slices.

This is exact: today every read overlapping a transcript exon already increments
that transcript's coverage, so per-transcript cov[t] equals the total genomic
coverage at that position — which is what the union array stores. Outputs are
byte-identical (composition, coverage, insert size all unchanged).

Combined with the streaming gene-model load, peak RSS on the full PE dataset
(75.2M reads) drops 2.16 GB -> 788 MB, and runtime 43s -> 38s (accumulating once
per locus rather than once per shared transcript).
Adversarial review pass (3 reviewers) over the rna command and gene-model code.
All safe, non-controversial fixes:

- Harden against malformed input (no panic/underflow on adversarial files):
  skip zero-length CIGAR M/=/X ops in rec_alignment_blocks/mate_alignment_blocks
  (block_start + block_len - 1 would underflow); reject exons/features whose end
  precedes start in all three parsers (would invert Exon::length()).
- Fix the standalone `rna --help` examples (were showing the multi-prefixed
  --rna::gene-model flags, which the standalone command does not accept).
- Clearer error when a gzipped gene model lacks a .gz/.bgz extension (fgoxide
  detects compression by extension): sniff the gzip magic and say so.
- Standards compliance: move `impl ParsedTranscript` next to its struct; document
  ParsedTranscript and CoverageMetrics fields; use f64::total_cmp instead of
  partial_cmp().unwrap() in the coverage/median sorts (no NaN panic).
- CHANGELOG: document the rna command and the startup output-writability check.

Output is byte-identical on the PE test set; 620 tests pass, clippy/fmt clean.
The Q2 refactor replaced per-base classification (assign_locus_function over a
LocusFunction array) with interval arithmetic (GeneModel::classify_blocks). That
left Transcript::assign_locus_function, is_utr, transcript_coordinate, is_coding,
and the LocusFunction enum used only by their own unit tests — dead in production
and a drift risk (is_utr handled malformed CDS differently than the live
coding_union path). Remove them and their tests; keep transcribed_length (still
used) with a focused unit test. Output is byte-identical.
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new gene-model module for refFlat, GTF, and GFF3 loading and block classification, wires a new rna CLI subcommand and multi support, and documents the command in the README, changelog, and ERRATA. It also adds startup output-prefix writability checks across commands, updates insert-size histogram handling to use a shared PDF writer, and expands tests and metric docs.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding the new rna command for RNA-seq QC metrics and transcript insert size.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description matches the changeset, covering the new rna command, docs updates, output files, and related behavior changes.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tf_rna

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
tests/test_rna.rs (1)

247-247: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use the shared float assertion macro in these tests.

These ad-hoc abs() < eps checks are inconsistent with the test suite’s standard helper and give worse failure output. Replace them with assert_float_eq!(a, b, eps). As per coding guidelines, tests/**/*.rs: "Use assert_float_eq!(a, b, eps) macro for float comparisons in tests".

Also applies to: 335-339, 411-424, 438-439, 455-455

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_rna.rs` at line 247, Replace the ad-hoc float comparison in the
RNA tests with the shared float assertion macro. In the affected test cases
around the frac_ribosomal_bases and other float checks in tests/test_rna.rs, use
assert_float_eq!(a, b, eps) instead of manual abs() < eps logic so the tests
follow the suite’s standard pattern and produce consistent failure output.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/commands/common.rs`:
- Around line 22-25: Update the directory resolution logic in `common::` so root
prefixes like `/` or Windows drive roots are treated as the filesystem root
instead of falling back to `.`. The current `prefix.parent()` handling in the
`dir` selection should preserve root paths when `Path::parent()` returns `None`
or an empty parent. Add a regression test around the startup check path to cover
a root prefix case and verify it does not incorrectly pass as the current
directory.

In `@src/commands/rna.rs`:
- Around line 671-678: The top-1000 selection in the RNA command can include
more than 1000 transcripts when multiple loci share the cutoff mean, so tighten
the filtering in the selection logic around best_per_locus/means to
deterministically keep exactly 1000 entries. Update the threshold-based check
and the later loop that builds the retained transcript set so ties at the
boundary are broken in a stable way or truncated after sorting, ensuring n_tx
cannot exceed 1000 and the downstream bias/CV medians reflect the intended
top-1000 behavior.
- Around line 1230-1240: Update mate_alignment_end to follow the command’s
fragment-end contract: when mate_cigar/mate_alignment_blocks cannot derive the
mate end from MC, fall back to TLEN before using the read length. Locate the
logic in mate_alignment_end and replace the current sequence-length-based
fallback with TLEN-derived end coordinates, keeping the existing MC path intact
so pair-span bounds used by ribosomal classification and strand-template
inference remain correct.

In `@src/gene_model/gff.rs`:
- Around line 109-114: The tx_order loop in gff.rs is still indexing features
with features[&id], which will panic when an exon/CDS references an orphan
Parent ID that never created a feature record. Update this lookup in the parsing
path to use a safe get-based check in the same block that already handles
exons_by_parent, and either return a parse error or skip the orphan explicitly.
Keep the rest of the transcript/gene resolution logic intact by guarding the
feature access before using feature.parents.first() and the subsequent gene
lookup.

In `@src/gene_model/mod.rs`:
- Around line 447-449: `detect_format` is currently swallowing read/sniffing
failures by using `line.ok()?`, which turns I/O or decompression problems into
an unknown format result. Update `detect_format` in `gene_model::mod` to return
`Result<Option<GeneModelFormat>>` instead of `Option`, and propagate the
`BufRead::lines()` read error upward rather than converting it to `None`. Make
sure callers that rely on `detect_format` handle the error case explicitly,
while preserving the existing format-detection logic for successfully read
lines.
- Around line 514-518: The transcript clustering logic in the
`clusters.iter_mut().find(...)` block is comparing `other.contig` and
`tx.contig` as raw strings, which can split the same locus when contig aliases
differ. Update the comparison to use the resolved BAM contig identity instead of
the original names, reusing the existing contig-resolution logic used elsewhere
in `src/gene_model/mod.rs` before calling `spans_overlap`. Make sure transcripts
like `1`/`chr1` or `MT`/`chrM` are grouped into the same cluster when they map
to the same underlying contig.

In `@src/plotting.rs`:
- Around line 129-130: The early return in plotting logic leaves an old PDF
behind when no series qualify, so update the empty-case handling in the plotting
routine in `src/plotting.rs` to clear any previously generated output at `path`
before returning. Make the fix in the same branch that checks `plots.is_empty()`
so stale histogram files are removed whenever the current run produces no plot.

---

Nitpick comments:
In `@tests/test_rna.rs`:
- Line 247: Replace the ad-hoc float comparison in the RNA tests with the shared
float assertion macro. In the affected test cases around the
frac_ribosomal_bases and other float checks in tests/test_rna.rs, use
assert_float_eq!(a, b, eps) instead of manual abs() < eps logic so the tests
follow the suite’s standard pattern and produce consistent failure output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 185780c6-cb77-421d-986b-79d5f06a8684

📥 Commits

Reviewing files that changed from the base of the PR and between 3763b9a and 72afd23.

📒 Files selected for processing (25)
  • CHANGELOG.md
  • README.md
  • src/commands/alignment.rs
  • src/commands/basic.rs
  • src/commands/common.rs
  • src/commands/docs.rs
  • src/commands/error.rs
  • src/commands/gcbias.rs
  • src/commands/hybcap.rs
  • src/commands/isize.rs
  • src/commands/mod.rs
  • src/commands/multi.rs
  • src/commands/rna.rs
  • src/commands/wgs.rs
  • src/gene_model/gff.rs
  • src/gene_model/gtf.rs
  • src/gene_model/mod.rs
  • src/gene_model/refflat.rs
  • src/intervals.rs
  • src/lib.rs
  • src/main.rs
  • src/plotting.rs
  • tests/helpers/mod.rs
  • tests/test_multi.rs
  • tests/test_rna.rs

Comment thread src/commands/common.rs
Comment thread src/commands/rna.rs
Comment thread src/commands/rna.rs Outdated
Comment thread src/gene_model/gff.rs
Comment thread src/gene_model/mod.rs Outdated
Comment thread src/gene_model/mod.rs
Comment thread src/plotting.rs
tfenne added 2 commits June 27, 2026 05:54
…o --ignore-chrom

Expand the command-level long help with dedicated sections on the supported
gene-model formats (refFlat/GTF/GFF3, plain or gzipped, content-detected), how the
ribosomal interval set is derived (union of biotype-derived rRNA genes — GTF/GFF3
contribute, refFlat does not — plus an explicit intervals file), transcript-space
insert-size estimation (MC-tag requirement, orientations, acceptance rule), and
strand auto-detection.

Per-option clarifications:
- --minimum-length: name the exact coverage/bias columns/outputs it gates and note
  it doesn't affect composition/strand/insert size.
- --end-bias-bases: name the bias columns it sizes.
- --isize-minimum-overlap: reword to 'fraction of read+mate bases that must overlap a
  transcript's exons for the pair's insert size to be trusted and counted'.
- --strand / --rrna-fragment-fraction / --isize-deviations: tie each to its outputs.

Rename --ignore-sequence to --ignore-chrom (it takes chromosome/contig names, mirroring
Picard's IGNORE_SEQUENCE) and document it clearly. No metric logic changed; output is
byte-identical; 618 tests pass.
…t prefix)

Three fixes from the PR #40 review:

- mate_alignment_end: fall back to the TLEN-derived fragment end
  (min(read_start, mate_start) + |TLEN| - 1, exactly Picard's
  CoordMath.getEnd(fragmentStart, abs(insertSize))) when the MC tag is absent,
  instead of the read's own length. This matches the documented contract (the
  module doc and README both say 'falling back to TLEN, never read length') and
  the values fed to the ribosomal and strand-template fragment-span checks. Only
  affects pairs lacking MC; output on MC-tagged data is byte-identical. Extracted
  tlen_fragment_end as a pure, unit-tested helper.

- detect_format: return Result<Option<GeneModelFormat>> and propagate line read /
  decompression errors instead of collapsing them to 'could not detect format'
  via line.ok()?. The gzip-without-extension hint now runs up front so that case
  still gets a targeted message.

- validate_output_prefix: treat a root prefix ('/', 'C:\') as the directory to
  check rather than falling back to '.'.

Declined (with rationale): tie-handling >1000 transcripts (f64 means don't tie in
practice), gff features[&id] 'panic' (false positive — every tx_order id is
inserted into features), cluster-on-resolved-contigs (real models use one contig
convention per file), and remove-stale-PDF (deleting a user's file to avoid a
rare stale-output case is the wrong trade-off).

619 tests pass; output byte-identical on the MC dataset.
tfenne added 3 commits June 27, 2026 06:44
The gene-model / ribosomal / insert-size detail was in after_long_help, so it
rendered after the options list. Move it into the command description (long_about)
so the layout is description -> options -> examples, matching our other commands.
Write each block as a single paragraph and let the terminal wrap (no hand-wrapping).
Drop the STRAND block, hoist the coordinate-sorted note above the titled blocks,
condense INSERT SIZE, and fix the per-option 'see ... below' refs to 'above'.
…ype output

Extends the rna command (still single-pass) with the RNA-specific QC metrics
the field expects from RSeQC/RNA-SeQC/Qualimap, folded into a wider
rna-metrics.txt plus a new per-biotype file:

- Read accounting + read-level genomic origin (exonic/intronic/intergenic/
  ambiguous/ribosomal reads; assigned_reads), genes_detected.
- Splice-junction annotation: known/partial-novel/novel, at observation and
  distinct-junction level, reusing the existing CIGAR walk + an annotated
  junction set built at init.
- Transcript Integrity Number (median/mean/stddev): a finish-time reduction
  over the per-transcript coverage already built for the bias/CV metrics, so
  no extra pass or indexed pileups. Gated on mean coverage > --tin-min-coverage
  and length >= --minimum-length, one representative (most-expressed) transcript
  per gene -- the isoform-robust choice (avoids gappy unexpressed isoforms).
- Per-biotype read counts in <prefix>.rna-biotype.txt; refFlat (no biotype
  column) infers protein_coding vs noncoding from CDS presence.

Field names settled before merge (rN_tx_strand_reads, dropped
incorrect_strand_reads). Throughput +1.5% over baseline (well within budget),
peak RSS +3%; composition/coverage/insert-size outputs unchanged. 631 tests
pass; fmt/clippy clean.
The README 'Key Differences vs. Picard' section had grown large. Move the
per-tool detail into a standalone ERRATA.md and replace it with a short
'Differences vs. Picard and Other Tools' section that gives a one-line summary
per tool linking into ERRATA.

Substantially expand the rna ERRATA section: a dedicated Transcript Integrity
Number subsection (formula, KL/scale-invariance interpretation, robustness to
spikes vs. gaps, multi-isoform behavior, gating rationale, and divergence from
RSeQC), plus the new read-origin / gene-detection / splice-junction / biotype
metrics. Also enrich the RnaSeqMetric field doc comments (what each means, how
it's computed, and where it differs from expectation: total vs mapped reads,
exonic = assigned + ambiguous, base- vs read-level origin, frac_usable_bases
denominator, TIN scale-invariance) -- these also feed 'riker docs'.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
ERRATA.md (2)

228-230: 📐 Maintainability & Code Quality | 🔵 Trivial

Add language specifier to fenced code block.

- ```
+ ```text
  TIN = 100 · e^H / n,   where  H = −Σ pᵢ·ln pᵢ,   pᵢ = cᵢ / Σⱼ cⱼ

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @ERRATA.md around lines 228 - 230, Add a language specifier to the fenced
code block in the TIN formula snippet so the markdown renders as a text block;
update the existing fence around the equation to use a text code fence while
keeping the formula content unchanged.


</details>

<!-- cr-comment:v1:4267d3d4f914df7a3731cdf0 -->

_Source: Linters/SAST tools_

---

`237-239`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_

**Add language specifier to fenced code block.**

```diff
- ```
+ ```text
  TIN = 100 · exp( −D_KL(normalized coverage ‖ uniform) )

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @ERRATA.md around lines 237 - 239, Add a language specifier to the fenced
code block in the ERRATA documentation snippet so it is marked as text (for
example, the block containing TIN = 100 · exp( −D_KL(normalized coverage ‖
uniform) ) should use a text fence). Update the markdown fence itself and keep
the equation content unchanged.


</details>

<!-- cr-comment:v1:d74abdec73475219503a0387 -->

_Source: Linters/SAST tools_

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @tests/test_rna.rs:

  • Around line 653-654: The strand-count check in the RNA test is too loose and
    no longer detects double-counting; update the assertion on m.r1_tx_strand_reads
    to require the exact expected value using the existing metric contract in this
    test, and keep the companion m.r2_tx_strand_reads assertion unchanged. Use the m
    variable in the test case with 10 FR pairs so the assertion verifies the precise
    strand count rather than a minimum threshold.

Nitpick comments:
In @ERRATA.md:

  • Around line 228-230: Add a language specifier to the fenced code block in the
    TIN formula snippet so the markdown renders as a text block; update the existing
    fence around the equation to use a text code fence while keeping the formula
    content unchanged.
  • Around line 237-239: Add a language specifier to the fenced code block in the
    ERRATA documentation snippet so it is marked as text (for example, the block
    containing TIN = 100 · exp( −D_KL(normalized coverage ‖ uniform) ) should use a
    text fence). Update the markdown fence itself and keep the equation content
    unchanged.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `7d97d606-38d7-4417-8a19-db82feb1d955`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between d425634d41ea6a16065f1f06723db39786ed6382 and 37e9993e504b51f396d1dc5573c782eb14082a7c.

</details>

<details>
<summary>📒 Files selected for processing (8)</summary>

* `CHANGELOG.md`
* `ERRATA.md`
* `README.md`
* `src/commands/docs.rs`
* `src/commands/rna.rs`
* `src/gene_model/mod.rs`
* `src/gene_model/refflat.rs`
* `tests/test_rna.rs`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (1)</summary>

* CHANGELOG.md

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary>

* src/commands/docs.rs
* src/gene_model/mod.rs
* src/commands/rna.rs

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread tests/test_rna.rs Outdated
CodeRabbit (PR #40): the auto_detects_forward_strandedness test builds exactly
10 FR pairs, so r1_tx_strand_reads is deterministically 10. Use assert_eq!(.,10)
instead of >= 10 so the test catches template double-counting.
…GENCODE v50

New plots (always emitted, FG-branded), both reusing the single-pass
aggregates already collected so there is no extra read pass:
- per-biotype assigned-read bar chart (log y-axis, scientific ticks,
  prettified labels, per-bar percentage annotation)
- per-gene RPK expression distribution overlaying all genes vs the
  protein-coding subset (log-x, semi-transparent fills, legend, and a
  zero-read-gene annotation)

Insert size: when the MC tag is absent, estimate FR pairs from a
spec-compliant TLEN at the forward read instead of skipping them. The
mate's 5' positions -- hence the 5'-to-5' transcript-space size -- are
exact from TLEN; the unknown mate splicing is gated by requiring both
mate ends to fall in an exon and a >=95% collapsed-exonic footprint.
Validated against the MC path on real data (9.7M FR pairs; mean/median
agree within ~0.5%). MC stays the exact, recommended input.

Drop the .rna-coverage.txt table -- Picard emits it only because its
plotting engine is external; riker plots internally, so the per-position
table added little. The .rna-coverage.pdf is kept and RnaCoverageEntry is
now an internal struct.

Soften the "MC required" language to "MC or spec-compliant TLEN" across
the tool docs, ERRATA, and README.

Add docs/rna-integrity.md: the TIN validation (degradation ladder + depth
downsampling sweep) redone on GENCODE v50 with a rustar-built index, for a
single-tool indexing/alignment story.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/rna-integrity.md (1)

26-29: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add language specifier to fenced code block.

The formula block at Line 26 lacks a language tag, triggering markdownlint-cli2 MD040. Add text or math to silence the lint and improve renderer consistency.

- ```
+ ```text
  TIN = 100 · e^H / n,   H = −Σ pᵢ·ln pᵢ,   pᵢ = cᵢ / Σⱼ cⱼ
      = 100 · exp( −D_KL(coverage ‖ uniform) )

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/rna-integrity.md around lines 26 - 29, The fenced formula block in the
RNA integrity doc is missing a language specifier, which triggers markdownlint
MD040. Update the fenced block around the TIN formula in the markdown so it uses
a language tag such as text or math, keeping the existing content unchanged to
satisfy the linter and improve rendering consistency.


</details>

<!-- cr-comment:v1:a75e9f760e29d064033f042a -->

_Source: Linters/SAST tools_

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @docs/rna-integrity.md:

  • Around line 26-29: The fenced formula block in the RNA integrity doc is
    missing a language specifier, which triggers markdownlint MD040. Update the
    fenced block around the TIN formula in the markdown so it uses a language tag
    such as text or math, keeping the existing content unchanged to satisfy the
    linter and improve rendering consistency.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `08c3800f-4c63-4fe7-a1b1-ef843981e7dd`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between a22cec3e6a11eb627511a5a63108b693efa84912 and a3ff63523bb514924f398be200c7d27a9f61d8a1.

</details>

<details>
<summary>⛔ Files ignored due to path filters (2)</summary>

* `docs/images/rna-integrity-rin-vs-tin.png` is excluded by `!**/*.png`
* `docs/images/rna-integrity-tin-vs-depth.png` is excluded by `!**/*.png`

</details>

<details>
<summary>📒 Files selected for processing (7)</summary>

* `ERRATA.md`
* `README.md`
* `docs/rna-integrity.md`
* `src/commands/docs.rs`
* `src/commands/rna.rs`
* `src/plotting.rs`
* `tests/test_rna.rs`

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (2)</summary>

* README.md
* ERRATA.md

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary>

* src/commands/docs.rs
* src/plotting.rs
* tests/test_rna.rs

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant