Conversation
…ipt insert size) Ports Picard CollectRnaSeqMetrics (base composition, strand specificity, 5'/3' coverage bias) and fgbio EstimateRnaSeqInsertSize (transcript-space insert size) into a single `riker rna` command that makes one BAM pass over a single gene-model load. Also wired into `multi`. Why a new gene_model module: both analyses need transcript/exon/CDS structure, and we want to accept the formats users actually have rather than forcing a refFlat conversion. `src/gene_model/` parses UCSC refFlat, GTF, and GFF3 (GENCODE/Ensembl/RefSeq) with content-based auto-detection and provider-agnostic attribute handling, exposing a queryable GeneModel (Gene -> GeneLocus -> Transcript; multi-chromosome genes become independent loci). Coordinates are 1-based inclusive internally to match both reference tools. Several latent bugs in the reference tools are fixed and documented in the rna module docs: top-1001 -> top-1000 transcript selection; ribosomal overlap union (not Math.max); short-transcript exclusion from bias; per-base coverage undercount; MC-or-TLEN fragment end; non-overlapping coverage-vs-position bins. Filtering is unified and Picard-biased: duplicates included by default (--exclude-duplicates to drop), --min-mapq default 0, strand auto-detected by default. Ribosomal territory is the union of biotype-derived rRNA/Mt_rRNA loci and an optional explicit BED/IntervalList. Transcript-space insert size requires the MC tag (pairs lacking it are skipped, never approximated). Shared work: the insert-size histogram chart is extracted to plotting::write_insert_size_histogram_pdf and reused by `isize` and `rna`; the gzip-aware text reader is shared as intervals::read_file_contents. Tests: 49 new programmatic tests (gene-model parsing/auto-detection, base classification across all three formats, ribosomal, strand auto-detect, transcript-space insert size, coverage/bias edge cases, filtering).
NCBI RefSeq GFF3 files refer to contigs by accession (e.g. NC_000001.11) rather than common names (chr1 / 1), so against a chr-named BAM every locus was dropped by contig reconciliation. Read the accession->name mapping from the `region`/`chromosome` feature attributes (`chromosome=`/`Name=`, as fgbio does) and translate each transcript's contig through it; the existing chr/MT reconciliation then resolves the common name to the BAM. Verified on the real GCF_000001405.40_GRCh38.p14 GFF3 (0 -> 57,641 loci loaded against a GRCh38 chr-named RNA-seq BAM).
Every command derives its output paths by appending a suffix to the -o/--output prefix, and only attempts the write in finish(). A missing or unwritable output directory therefore surfaced only after the entire input had been processed — e.g. a full RNA-seq pass would run for minutes and then fail at the end. Add common::validate_output_prefix, which checks the prefix's parent directory exists, is a directory, and is writable (by creating and removing a probe file), and call it at the start of execute() for every command (alignment, basic, error, gcbias, hybcap, isize, rna, wgs, multi). Unwritable destinations now fail in <1s before any reading or gene-model parsing.
…d/fgbio Add rna to the tools table and usage examples, and a 'Differences in rna vs. CollectRnaSeqMetrics (and fgbio EstimateRnaSeqInsertSize)' section covering the gene-model formats + contig reconciliation, biotype+explicit ribosomal regions, strand auto-detection, transcript-space insert size, the coverage/bias off-by-one fixes (top-1000, short-transcript exclusion, per-base coverage, non-overlapping position bins), the ribosomal union/MC fragment-end fixes, the gene->locus->transcript model, and the duplicate/MAPQ/accumulation-level behavior.
classify_bases dominated runtime (~80% of samples). assign_locus_function and add_coverage_counts each rescanned the full exon list for every aligned base — O(positions x exons) per transcript per read. Rewrite both as single passes (a moving exon pointer for classification, contiguous range fills for coverage), turning them into O(positions + exons). Also reuse a per-collector scratch buffer for the locus-function array instead of allocating one per alignment block. Outputs are byte-identical on the PE test set; throughput on a 16M-read BAM with the GENCODE v29 GTF goes from ~175k to ~570k reads/s.
Per-base functional classification ran assign_locus_function for every transcript of every overlapping locus, so its cost scaled with the number of transcripts per gene (~45% of runtime after the previous fix). The result is just the max function across transcripts, so precompute each locus's merged exon and coding (exon ∩ CDS) interval unions at load time and classify each block once per locus via GeneLocus::classify_block. Per-transcript coverage accumulation is unchanged (it genuinely needs per-transcript detail). Outputs are byte-identical on the PE test set; throughput on the 16M-read GENCODE v29 run goes from ~570k to ~900k reads/s.
Replace the per-base locus-function array (classify_block + the per-block funcs scratch buffer) with interval arithmetic: intersect each alignment block with the merged coding / exon / span interval sets and derive the per-class base counts directly (coding = block∩coding; utr = block∩exon − coding; intronic = block∩span − exon; intergenic = block − span). For the common 0/1-locus read the locus's precomputed unions are used directly; multi-locus reads merge the loci's unions first, so overlapping genes compose by union — exactly reproducing the cross-locus max the per-base path computed. Output is byte-identical. Full PE dataset (75.2M reads, GENCODE v29 GTF): 60s -> 49s (~1.25M -> ~1.57M reads/s). Adds direct unit tests for classify_blocks/overlap_with_intervals.
Replace the gene-model interval tree with a coordinate sweep over the (now-required) coordinate-sorted reads. Loci are indexed per contig sorted by start; the collector opens loci as their start passes the read end and drops them once their end falls behind the read start, filtering the retained set to those actually overlapping the read (a long spliced read can open downstream loci). initialize() now requires @hd SO:coordinate and errors otherwise, with a runtime monotonicity guard as backstop. Insert size now derives its gene from the first read's swept overlaps and accepts a pair when exactly one of them *encloses* the pair span (plus the existing per-transcript exon-overlap and agreement checks), instead of fgbio's 'exactly one gene overlaps the pair span'. A non-enclosing bystander gene near the pair no longer disqualifies it: on the full PE set this uses 6.7% more pairs (12.74M -> 13.60M) with mean +1% (208.9 -> 210.9), median 195 -> 197. Composition and coverage outputs are byte-identical. Full PE dataset (75.2M reads): 49s -> 43s (~1.57M -> ~1.75M reads/s). Drops the per-read locus allocation and the separate insert-size overlap query.
…insert size The perf work added a coordinate-sorted input requirement (the locus sweep) and changed the insert-size acceptance rule to 'exactly one gene encloses the pair' (vs fgbio's 'one gene overlaps the pair span'). Document both in the rna differences section.
read_file_contents read the entire gene model into one String (~1.2 GB uncompressed for GENCODE v29) before parsing. Open it with fgoxide's Io::new_reader (transparent gzip by extension) and stream line by line: the three parsers now take impl BufRead and iterate .lines(); format detection reads a throwaway reader and parsing re-opens a fresh one. Eliminates the multi- hundred-MB-to-GB init-time allocation; output unchanged.
Coverage was stored as a Vec<u32> per (locus, transcript), so a base shared by N transcripts of a gene was accumulated N times and stored N times — 316.7M u32 (1.27 GB) for GENCODE v29. Store one coverage array per locus over its concatenated exon union (introns removed, shared exons stored once: 140.2M u32, 0.56 GB) and accumulate into it once per overlapping locus. Per-transcript coverage for the bias/CV/histogram metrics is reconstructed at finish by mapping each transcript's exons back to their exon-union slices. This is exact: today every read overlapping a transcript exon already increments that transcript's coverage, so per-transcript cov[t] equals the total genomic coverage at that position — which is what the union array stores. Outputs are byte-identical (composition, coverage, insert size all unchanged). Combined with the streaming gene-model load, peak RSS on the full PE dataset (75.2M reads) drops 2.16 GB -> 788 MB, and runtime 43s -> 38s (accumulating once per locus rather than once per shared transcript).
Adversarial review pass (3 reviewers) over the rna command and gene-model code. All safe, non-controversial fixes: - Harden against malformed input (no panic/underflow on adversarial files): skip zero-length CIGAR M/=/X ops in rec_alignment_blocks/mate_alignment_blocks (block_start + block_len - 1 would underflow); reject exons/features whose end precedes start in all three parsers (would invert Exon::length()). - Fix the standalone `rna --help` examples (were showing the multi-prefixed --rna::gene-model flags, which the standalone command does not accept). - Clearer error when a gzipped gene model lacks a .gz/.bgz extension (fgoxide detects compression by extension): sniff the gzip magic and say so. - Standards compliance: move `impl ParsedTranscript` next to its struct; document ParsedTranscript and CoverageMetrics fields; use f64::total_cmp instead of partial_cmp().unwrap() in the coverage/median sorts (no NaN panic). - CHANGELOG: document the rna command and the startup output-writability check. Output is byte-identical on the PE test set; 620 tests pass, clippy/fmt clean.
The Q2 refactor replaced per-base classification (assign_locus_function over a LocusFunction array) with interval arithmetic (GeneModel::classify_blocks). That left Transcript::assign_locus_function, is_utr, transcript_coordinate, is_coding, and the LocusFunction enum used only by their own unit tests — dead in production and a drift risk (is_utr handled malformed CDS differently than the live coding_union path). Remove them and their tests; keep transcribed_length (still used) with a focused unit test. Output is byte-identical.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new gene-model module for refFlat, GTF, and GFF3 loading and block classification, wires a new 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
tests/test_rna.rs (1)
247-247: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse the shared float assertion macro in these tests.
These ad-hoc
abs() < epschecks are inconsistent with the test suite’s standard helper and give worse failure output. Replace them withassert_float_eq!(a, b, eps). As per coding guidelines,tests/**/*.rs: "Useassert_float_eq!(a, b, eps)macro for float comparisons in tests".Also applies to: 335-339, 411-424, 438-439, 455-455
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_rna.rs` at line 247, Replace the ad-hoc float comparison in the RNA tests with the shared float assertion macro. In the affected test cases around the frac_ribosomal_bases and other float checks in tests/test_rna.rs, use assert_float_eq!(a, b, eps) instead of manual abs() < eps logic so the tests follow the suite’s standard pattern and produce consistent failure output.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/commands/common.rs`:
- Around line 22-25: Update the directory resolution logic in `common::` so root
prefixes like `/` or Windows drive roots are treated as the filesystem root
instead of falling back to `.`. The current `prefix.parent()` handling in the
`dir` selection should preserve root paths when `Path::parent()` returns `None`
or an empty parent. Add a regression test around the startup check path to cover
a root prefix case and verify it does not incorrectly pass as the current
directory.
In `@src/commands/rna.rs`:
- Around line 671-678: The top-1000 selection in the RNA command can include
more than 1000 transcripts when multiple loci share the cutoff mean, so tighten
the filtering in the selection logic around best_per_locus/means to
deterministically keep exactly 1000 entries. Update the threshold-based check
and the later loop that builds the retained transcript set so ties at the
boundary are broken in a stable way or truncated after sorting, ensuring n_tx
cannot exceed 1000 and the downstream bias/CV medians reflect the intended
top-1000 behavior.
- Around line 1230-1240: Update mate_alignment_end to follow the command’s
fragment-end contract: when mate_cigar/mate_alignment_blocks cannot derive the
mate end from MC, fall back to TLEN before using the read length. Locate the
logic in mate_alignment_end and replace the current sequence-length-based
fallback with TLEN-derived end coordinates, keeping the existing MC path intact
so pair-span bounds used by ribosomal classification and strand-template
inference remain correct.
In `@src/gene_model/gff.rs`:
- Around line 109-114: The tx_order loop in gff.rs is still indexing features
with features[&id], which will panic when an exon/CDS references an orphan
Parent ID that never created a feature record. Update this lookup in the parsing
path to use a safe get-based check in the same block that already handles
exons_by_parent, and either return a parse error or skip the orphan explicitly.
Keep the rest of the transcript/gene resolution logic intact by guarding the
feature access before using feature.parents.first() and the subsequent gene
lookup.
In `@src/gene_model/mod.rs`:
- Around line 447-449: `detect_format` is currently swallowing read/sniffing
failures by using `line.ok()?`, which turns I/O or decompression problems into
an unknown format result. Update `detect_format` in `gene_model::mod` to return
`Result<Option<GeneModelFormat>>` instead of `Option`, and propagate the
`BufRead::lines()` read error upward rather than converting it to `None`. Make
sure callers that rely on `detect_format` handle the error case explicitly,
while preserving the existing format-detection logic for successfully read
lines.
- Around line 514-518: The transcript clustering logic in the
`clusters.iter_mut().find(...)` block is comparing `other.contig` and
`tx.contig` as raw strings, which can split the same locus when contig aliases
differ. Update the comparison to use the resolved BAM contig identity instead of
the original names, reusing the existing contig-resolution logic used elsewhere
in `src/gene_model/mod.rs` before calling `spans_overlap`. Make sure transcripts
like `1`/`chr1` or `MT`/`chrM` are grouped into the same cluster when they map
to the same underlying contig.
In `@src/plotting.rs`:
- Around line 129-130: The early return in plotting logic leaves an old PDF
behind when no series qualify, so update the empty-case handling in the plotting
routine in `src/plotting.rs` to clear any previously generated output at `path`
before returning. Make the fix in the same branch that checks `plots.is_empty()`
so stale histogram files are removed whenever the current run produces no plot.
---
Nitpick comments:
In `@tests/test_rna.rs`:
- Line 247: Replace the ad-hoc float comparison in the RNA tests with the shared
float assertion macro. In the affected test cases around the
frac_ribosomal_bases and other float checks in tests/test_rna.rs, use
assert_float_eq!(a, b, eps) instead of manual abs() < eps logic so the tests
follow the suite’s standard pattern and produce consistent failure output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 185780c6-cb77-421d-986b-79d5f06a8684
📒 Files selected for processing (25)
CHANGELOG.mdREADME.mdsrc/commands/alignment.rssrc/commands/basic.rssrc/commands/common.rssrc/commands/docs.rssrc/commands/error.rssrc/commands/gcbias.rssrc/commands/hybcap.rssrc/commands/isize.rssrc/commands/mod.rssrc/commands/multi.rssrc/commands/rna.rssrc/commands/wgs.rssrc/gene_model/gff.rssrc/gene_model/gtf.rssrc/gene_model/mod.rssrc/gene_model/refflat.rssrc/intervals.rssrc/lib.rssrc/main.rssrc/plotting.rstests/helpers/mod.rstests/test_multi.rstests/test_rna.rs
…o --ignore-chrom Expand the command-level long help with dedicated sections on the supported gene-model formats (refFlat/GTF/GFF3, plain or gzipped, content-detected), how the ribosomal interval set is derived (union of biotype-derived rRNA genes — GTF/GFF3 contribute, refFlat does not — plus an explicit intervals file), transcript-space insert-size estimation (MC-tag requirement, orientations, acceptance rule), and strand auto-detection. Per-option clarifications: - --minimum-length: name the exact coverage/bias columns/outputs it gates and note it doesn't affect composition/strand/insert size. - --end-bias-bases: name the bias columns it sizes. - --isize-minimum-overlap: reword to 'fraction of read+mate bases that must overlap a transcript's exons for the pair's insert size to be trusted and counted'. - --strand / --rrna-fragment-fraction / --isize-deviations: tie each to its outputs. Rename --ignore-sequence to --ignore-chrom (it takes chromosome/contig names, mirroring Picard's IGNORE_SEQUENCE) and document it clearly. No metric logic changed; output is byte-identical; 618 tests pass.
…t prefix) Three fixes from the PR #40 review: - mate_alignment_end: fall back to the TLEN-derived fragment end (min(read_start, mate_start) + |TLEN| - 1, exactly Picard's CoordMath.getEnd(fragmentStart, abs(insertSize))) when the MC tag is absent, instead of the read's own length. This matches the documented contract (the module doc and README both say 'falling back to TLEN, never read length') and the values fed to the ribosomal and strand-template fragment-span checks. Only affects pairs lacking MC; output on MC-tagged data is byte-identical. Extracted tlen_fragment_end as a pure, unit-tested helper. - detect_format: return Result<Option<GeneModelFormat>> and propagate line read / decompression errors instead of collapsing them to 'could not detect format' via line.ok()?. The gzip-without-extension hint now runs up front so that case still gets a targeted message. - validate_output_prefix: treat a root prefix ('/', 'C:\') as the directory to check rather than falling back to '.'. Declined (with rationale): tie-handling >1000 transcripts (f64 means don't tie in practice), gff features[&id] 'panic' (false positive — every tx_order id is inserted into features), cluster-on-resolved-contigs (real models use one contig convention per file), and remove-stale-PDF (deleting a user's file to avoid a rare stale-output case is the wrong trade-off). 619 tests pass; output byte-identical on the MC dataset.
The gene-model / ribosomal / insert-size detail was in after_long_help, so it rendered after the options list. Move it into the command description (long_about) so the layout is description -> options -> examples, matching our other commands. Write each block as a single paragraph and let the terminal wrap (no hand-wrapping). Drop the STRAND block, hoist the coordinate-sorted note above the titled blocks, condense INSERT SIZE, and fix the per-option 'see ... below' refs to 'above'.
…ype output Extends the rna command (still single-pass) with the RNA-specific QC metrics the field expects from RSeQC/RNA-SeQC/Qualimap, folded into a wider rna-metrics.txt plus a new per-biotype file: - Read accounting + read-level genomic origin (exonic/intronic/intergenic/ ambiguous/ribosomal reads; assigned_reads), genes_detected. - Splice-junction annotation: known/partial-novel/novel, at observation and distinct-junction level, reusing the existing CIGAR walk + an annotated junction set built at init. - Transcript Integrity Number (median/mean/stddev): a finish-time reduction over the per-transcript coverage already built for the bias/CV metrics, so no extra pass or indexed pileups. Gated on mean coverage > --tin-min-coverage and length >= --minimum-length, one representative (most-expressed) transcript per gene -- the isoform-robust choice (avoids gappy unexpressed isoforms). - Per-biotype read counts in <prefix>.rna-biotype.txt; refFlat (no biotype column) infers protein_coding vs noncoding from CDS presence. Field names settled before merge (rN_tx_strand_reads, dropped incorrect_strand_reads). Throughput +1.5% over baseline (well within budget), peak RSS +3%; composition/coverage/insert-size outputs unchanged. 631 tests pass; fmt/clippy clean.
The README 'Key Differences vs. Picard' section had grown large. Move the per-tool detail into a standalone ERRATA.md and replace it with a short 'Differences vs. Picard and Other Tools' section that gives a one-line summary per tool linking into ERRATA. Substantially expand the rna ERRATA section: a dedicated Transcript Integrity Number subsection (formula, KL/scale-invariance interpretation, robustness to spikes vs. gaps, multi-isoform behavior, gating rationale, and divergence from RSeQC), plus the new read-origin / gene-detection / splice-junction / biotype metrics. Also enrich the RnaSeqMetric field doc comments (what each means, how it's computed, and where it differs from expectation: total vs mapped reads, exonic = assigned + ambiguous, base- vs read-level origin, frac_usable_bases denominator, TIN scale-invariance) -- these also feed 'riker docs'.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
ERRATA.md (2)
228-230: 📐 Maintainability & Code Quality | 🔵 TrivialAdd language specifier to fenced code block.
- ``` + ```text TIN = 100 · e^H / n, where H = −Σ pᵢ·ln pᵢ, pᵢ = cᵢ / Σⱼ cⱼ<details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@ERRATA.mdaround lines 228 - 230, Add a language specifier to the fenced
code block in the TIN formula snippet so the markdown renders as a text block;
update the existing fence around the equation to use a text code fence while
keeping the formula content unchanged.</details> <!-- cr-comment:v1:4267d3d4f914df7a3731cdf0 --> _Source: Linters/SAST tools_ --- `237-239`: _📐 Maintainability & Code Quality_ | _🔵 Trivial_ **Add language specifier to fenced code block.** ```diff - ``` + ```text TIN = 100 · exp( −D_KL(normalized coverage ‖ uniform) )<details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@ERRATA.mdaround lines 237 - 239, Add a language specifier to the fenced
code block in the ERRATA documentation snippet so it is marked as text (for
example, the block containing TIN = 100 · exp( −D_KL(normalized coverage ‖
uniform) ) should use a text fence). Update the markdown fence itself and keep
the equation content unchanged.</details> <!-- cr-comment:v1:d74abdec73475219503a0387 --> _Source: Linters/SAST tools_ </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Inline comments:
In@tests/test_rna.rs:
- Around line 653-654: The strand-count check in the RNA test is too loose and
no longer detects double-counting; update the assertion on m.r1_tx_strand_reads
to require the exact expected value using the existing metric contract in this
test, and keep the companion m.r2_tx_strand_reads assertion unchanged. Use the m
variable in the test case with 10 FR pairs so the assertion verifies the precise
strand count rather than a minimum threshold.
Nitpick comments:
In@ERRATA.md:
- Around line 228-230: Add a language specifier to the fenced code block in the
TIN formula snippet so the markdown renders as a text block; update the existing
fence around the equation to use a text code fence while keeping the formula
content unchanged.- Around line 237-239: Add a language specifier to the fenced code block in the
ERRATA documentation snippet so it is marked as text (for example, the block
containing TIN = 100 · exp( −D_KL(normalized coverage ‖ uniform) ) should use a
text fence). Update the markdown fence itself and keep the equation content
unchanged.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `7d97d606-38d7-4417-8a19-db82feb1d955` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between d425634d41ea6a16065f1f06723db39786ed6382 and 37e9993e504b51f396d1dc5573c782eb14082a7c. </details> <details> <summary>📒 Files selected for processing (8)</summary> * `CHANGELOG.md` * `ERRATA.md` * `README.md` * `src/commands/docs.rs` * `src/commands/rna.rs` * `src/gene_model/mod.rs` * `src/gene_model/refflat.rs` * `tests/test_rna.rs` </details> <details> <summary>✅ Files skipped from review due to trivial changes (1)</summary> * CHANGELOG.md </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary> * src/commands/docs.rs * src/gene_model/mod.rs * src/commands/rna.rs </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
CodeRabbit (PR #40): the auto_detects_forward_strandedness test builds exactly 10 FR pairs, so r1_tx_strand_reads is deterministically 10. Use assert_eq!(.,10) instead of >= 10 so the test catches template double-counting.
…GENCODE v50 New plots (always emitted, FG-branded), both reusing the single-pass aggregates already collected so there is no extra read pass: - per-biotype assigned-read bar chart (log y-axis, scientific ticks, prettified labels, per-bar percentage annotation) - per-gene RPK expression distribution overlaying all genes vs the protein-coding subset (log-x, semi-transparent fills, legend, and a zero-read-gene annotation) Insert size: when the MC tag is absent, estimate FR pairs from a spec-compliant TLEN at the forward read instead of skipping them. The mate's 5' positions -- hence the 5'-to-5' transcript-space size -- are exact from TLEN; the unknown mate splicing is gated by requiring both mate ends to fall in an exon and a >=95% collapsed-exonic footprint. Validated against the MC path on real data (9.7M FR pairs; mean/median agree within ~0.5%). MC stays the exact, recommended input. Drop the .rna-coverage.txt table -- Picard emits it only because its plotting engine is external; riker plots internally, so the per-position table added little. The .rna-coverage.pdf is kept and RnaCoverageEntry is now an internal struct. Soften the "MC required" language to "MC or spec-compliant TLEN" across the tool docs, ERRATA, and README. Add docs/rna-integrity.md: the TIN validation (degradation ladder + depth downsampling sweep) redone on GENCODE v50 with a rustar-built index, for a single-tool indexing/alignment story.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/rna-integrity.md (1)
26-29: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd language specifier to fenced code block.
The formula block at Line 26 lacks a language tag, triggering
markdownlint-cli2MD040. Addtextormathto silence the lint and improve renderer consistency.- ``` + ```text TIN = 100 · e^H / n, H = −Σ pᵢ·ln pᵢ, pᵢ = cᵢ / Σⱼ cⱼ = 100 · exp( −D_KL(coverage ‖ uniform) )<details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@docs/rna-integrity.mdaround lines 26 - 29, The fenced formula block in the
RNA integrity doc is missing a language specifier, which triggers markdownlint
MD040. Update the fenced block around the TIN formula in the markdown so it uses
a language tag such as text or math, keeping the existing content unchanged to
satisfy the linter and improve rendering consistency.</details> <!-- cr-comment:v1:a75e9f760e29d064033f042a --> _Source: Linters/SAST tools_ </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Nitpick comments:
In@docs/rna-integrity.md:
- Around line 26-29: The fenced formula block in the RNA integrity doc is
missing a language specifier, which triggers markdownlint MD040. Update the
fenced block around the TIN formula in the markdown so it uses a language tag
such as text or math, keeping the existing content unchanged to satisfy the
linter and improve rendering consistency.</details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro **Run ID**: `08c3800f-4c63-4fe7-a1b1-ef843981e7dd` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between a22cec3e6a11eb627511a5a63108b693efa84912 and a3ff63523bb514924f398be200c7d27a9f61d8a1. </details> <details> <summary>⛔ Files ignored due to path filters (2)</summary> * `docs/images/rna-integrity-rin-vs-tin.png` is excluded by `!**/*.png` * `docs/images/rna-integrity-tin-vs-depth.png` is excluded by `!**/*.png` </details> <details> <summary>📒 Files selected for processing (7)</summary> * `ERRATA.md` * `README.md` * `docs/rna-integrity.md` * `src/commands/docs.rs` * `src/commands/rna.rs` * `src/plotting.rs` * `tests/test_rna.rs` </details> <details> <summary>✅ Files skipped from review due to trivial changes (2)</summary> * README.md * ERRATA.md </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary> * src/commands/docs.rs * src/plotting.rs * tests/test_rna.rs </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Summary
Adds the
rnacommand: a single-pass, single-gene-model-load RNA-seq QC tool. It started as a port of PicardCollectRnaSeqMetrics(base composition, strand specificity, transcript 5'/3' coverage bias and CV) + fgbioEstimateRnaSeqInsertSize(transcript-space insert size, introns collapsed), and has grown into a superset that also covers the RSeQC / RNA-SeQC / Qualimap families: read accounting and genomic origin, gene detection, splice-junction annotation, transcript integrity (TIN), and per-biotype read counts. The collector is wired intomulti.Outputs (house conventions — lowercase headers,
frac_, no metadata lines):out.rna-metrics.txt(one wide row),out.rna-biotype.txt(+.pdf),out.rna-gene-expression.pdf,out.rna-coverage.pdf,out.rna-insert-size.txt/-histogram.txt(+.pdf).Metrics (one
rna-metrics.txtrow)assigned_readsand fractions.genes_detected(genes with ≥--genes-detected-min-reads).frac_known_juncs.frac_mrna_bases,frac_usable_bases.Highlights
chradd/strip,MT↔chrM, RefSeqNC_*→ common name). Gzip handled transparently.--strand auto, default) from observed R1/R2 transcription-strand counts;unstranded/forward/reverseforce a model.--ribosomal-intervalsfile (BED or IntervalList).MC(mate CIGAR) tag when present (exact, any orientation); whenMCis absent it estimates FR pairs from a spec-compliantTLENat the forward read rather than dropping them — see below.isizehelper so the two match).TIN scale & validation
riker's TIN is the largest deliberate numeric difference from the reference tools: it scores one well-covered transcript per gene, so its median runs ~14–16 points higher than RSeQC/RustQC, which score every isoform. The two report the same shape on a different scale. This is validated in
docs/rna-integrity.mdon a controlled RNA-degradation ladder (Sigurgeirsson et al. 2014, RIN 10→2) plus a depth-downsampling sweep: both implementations track degradation in lockstep, and riker is markedly more depth-robust (≈0.5% vs ≈4% TIN change across a 5× depth reduction). Indexing and alignment were both done with rustar-aligner against GENCODE v50, for a single-tool story.Intentional differences vs. Picard/fgbio (documented in ERRATA.md)
Deliberate corrections/choices, not regressions; on identical inputs read/base accounting matches Picard to the base and composition agrees within ~0.5%:
2×end_bias_bases) excluded from the CV/bias set; every aligned base counted (Picard drops the last base of each block); non-overlapping normalized-position bins.Math.max); fragment end from theMCtag, falling back toTLEN, never read length.MCthe mate's exact blocks are used (any orientation). Without it, FR pairs are kept by deriving the mate's right end fromstart + |TLEN| - 1at the forward read: the 5'-to-5' transcript-space size is then exact, and the unknown mate splicing is gated by requiring both mate ends in an exon plus a ≥95% collapsed-exonic footprint. Validated against the MC path on a real BAM (9.7M FR pairs; mean/median agree within ~0.5%). A BAM with neitherMCnor a validTLENyields no/incorrect insert sizes —samtools fixmate -mremains the way to guarantee exactness.@HD SO:coordinate); validated at startup, enabling the allocation-free locus sweep.Dropped Picard's per-position
*.rna-coverage.txttable — Picard emits it only because its plotting engine is external; riker plots internally, so only the.pdfis kept.Performance & memory
Profiled and tuned after the correctness-complete implementation (composition + coverage verified byte-identical at every step on a 75.2M-read PE BAM, GENCODE v29):
The metrics-expansion (read origin, junctions, biotype, TIN) added ~1.5% runtime. riker stays ~10× faster single-threaded than RustQC and far lighter. BGZF decompression (~11%) is the remaining single-threaded floor.
Testing & review
fmt/clippy --all-targets(pedantic) /nextestall clean.CollectRnaSeqMetrics(read accounting exact; composition ~0.5%; coverage/bias ~1.3%, in the corrected direction), cross-checked vs. RustQC (junctions, TIN, biotype), and the TIN dose-response validated end-to-end on real degradation-ladder data.Out of scope (tracked separately)
A curated GRCh38 rDNA BED / rRNA interval list with methodology, plus per-biotype/expression refinements — to be their own changes.