Great job @Psy-Fer!
Here’s my review of the code base and what I think we could do:
-
Code reuse prevents drift/partial fixes and improves maintainability: some ifexpressions have almost the same code on both sides, some blocks are near-verbatim duplicates. We can use loops or helper functions to centralize some of that code. E.g. I added this:
|
impl CigarOpExt for cigar::Op { |
|
fn add_len(&self, len: usize) -> Self { |
|
cigar::Op::new(self.kind(), self.len() + len) |
|
} |
|
} |
and used it in a little helper I added for a pattern I saw in finalize_transcript:
|
fn append_match(ops: &mut Vec<Op>, len: usize) { |
|
if let Some(op) = ops.last_mut() |
|
&& op.kind() == Kind::Match |
|
{ |
|
*op = op.add_len(len); |
|
} else { |
|
ops.push(Op::new(Kind::Match, len)); |
|
} |
|
} |
-
Clear use of numeric types prevents overflow bugs like
|
## Bug #1: Integer Overflow from Unsafe Casts |
|
|
|
### Problem |
|
- **Location:** `src/align/stitch.rs`, lines 423-424 |
|
- **Symptom:** CIGAR strings with massive values near 2³² (e.g., `10M4294953882D12M`) |
|
- **Root Cause:** Negative gap values cast directly to `u32` without validation |
We currently use u32, i32, u64, i64, and usize with no clear guidence of why we use what where. Would be amazing if we could eventually enable all the clippy lints for unsafe casts after cleaning all that up, e.g. changing if some_i32 > 0 { thing = some_i32 as u32 } to thing = u32::try_from(some_i32)... or so
-
Rust has a lot of ways to abstract things (e.g. the extension traits from the first point), and I thing we should use them, e.g. code like this encapsulating a single task should be part of a function or trait instead of just appearing in the business logic:
|
// Deduplicate transcripts with identical genomic coordinates AND CIGAR. |
|
transcripts.sort_by(|a, b| { |
|
(a.chr_idx, a.genome_start, a.genome_end, a.is_reverse) |
|
.cmp(&(b.chr_idx, b.genome_start, b.genome_end, b.is_reverse)) |
|
.then_with(|| cmp_cigar(&a.cigar, &b.cigar)) |
|
.then_with(|| b.score.cmp(&a.score)) |
|
}); |
|
transcripts.dedup_by(|a, b| { |
|
a.chr_idx == b.chr_idx |
|
&& a.genome_start == b.genome_start |
|
&& a.genome_end == b.genome_end |
|
&& a.is_reverse == b.is_reverse |
|
&& a.cigar == b.cigar |
|
}); |
Maybe using itertools could help in some cases? IDK if e.g. using its hash-based uniq here would be more efficient or less than our very ad-hoc looking sort-then-dedup.
-
I personally really like splitting up huge multi-step functions like the above-mentioned finalize_transcript into individual steps that are clear in what they receive and what they emit. Every function that has // 1. do first step: … could instead call another function for this step. custom structs can help holding common context for this.
Great job @Psy-Fer!
Here’s my review of the code base and what I think we could do:
Code reuse prevents drift/partial fixes and improves maintainability: some
ifexpressions have almost the same code on both sides, some blocks are near-verbatim duplicates. We can use loops or helper functions to centralize some of that code. E.g. I added this:rustar-aligner/src/align/transcript.rs
Lines 61 to 65 in 199290d
finalize_transcript:rustar-aligner/src/align/stitch.rs
Lines 1756 to 1764 in 199290d
Clear use of numeric types prevents overflow bugs like
rustar-aligner/docs-old/dev/BUGFIX_2026-02-09.md
Lines 21 to 26 in 199290d
We currently use
u32,i32,u64,i64, andusizewith no clear guidence of why we use what where. Would be amazing if we could eventually enable all the clippy lints for unsafe casts after cleaning all that up, e.g. changingif some_i32 > 0 { thing = some_i32 as u32 }tothing = u32::try_from(some_i32)...or soRust has a lot of ways to abstract things (e.g. the extension traits from the first point), and I thing we should use them, e.g. code like this encapsulating a single task should be part of a function or trait instead of just appearing in the business logic:
rustar-aligner/src/align/read_align.rs
Lines 353 to 366 in 199290d
Maybe using
itertoolscould help in some cases? IDK if e.g. using its hash-baseduniqhere would be more efficient or less than our very ad-hoc lookingsort-then-dedup.I personally really like splitting up huge multi-step functions like the above-mentioned
finalize_transcriptinto individual steps that are clear in what they receive and what they emit. Every function that has// 1. do first step: …could instead call another function for this step. custom structs can help holding common context for this.