Skip to content

Code organization and style #63

@flying-sheep

Description

@flying-sheep

Great job @Psy-Fer!

Here’s my review of the code base and what I think we could do:

  • Code reuse prevents drift/partial fixes and improves maintainability: some ifexpressions have almost the same code on both sides, some blocks are near-verbatim duplicates. We can use loops or helper functions to centralize some of that code. E.g. I added this:

    impl CigarOpExt for cigar::Op {
    fn add_len(&self, len: usize) -> Self {
    cigar::Op::new(self.kind(), self.len() + len)
    }
    }
    and used it in a little helper I added for a pattern I saw in finalize_transcript:
    fn append_match(ops: &mut Vec<Op>, len: usize) {
    if let Some(op) = ops.last_mut()
    && op.kind() == Kind::Match
    {
    *op = op.add_len(len);
    } else {
    ops.push(Op::new(Kind::Match, len));
    }
    }

  • Clear use of numeric types prevents overflow bugs like

    ## Bug #1: Integer Overflow from Unsafe Casts
    ### Problem
    - **Location:** `src/align/stitch.rs`, lines 423-424
    - **Symptom:** CIGAR strings with massive values near 2³² (e.g., `10M4294953882D12M`)
    - **Root Cause:** Negative gap values cast directly to `u32` without validation

    We currently use u32, i32, u64, i64, and usize with no clear guidence of why we use what where. Would be amazing if we could eventually enable all the clippy lints for unsafe casts after cleaning all that up, e.g. changing if some_i32 > 0 { thing = some_i32 as u32 } to thing = u32::try_from(some_i32)... or so

  • Rust has a lot of ways to abstract things (e.g. the extension traits from the first point), and I thing we should use them, e.g. code like this encapsulating a single task should be part of a function or trait instead of just appearing in the business logic:

    // Deduplicate transcripts with identical genomic coordinates AND CIGAR.
    transcripts.sort_by(|a, b| {
    (a.chr_idx, a.genome_start, a.genome_end, a.is_reverse)
    .cmp(&(b.chr_idx, b.genome_start, b.genome_end, b.is_reverse))
    .then_with(|| cmp_cigar(&a.cigar, &b.cigar))
    .then_with(|| b.score.cmp(&a.score))
    });
    transcripts.dedup_by(|a, b| {
    a.chr_idx == b.chr_idx
    && a.genome_start == b.genome_start
    && a.genome_end == b.genome_end
    && a.is_reverse == b.is_reverse
    && a.cigar == b.cigar
    });

    Maybe using itertools could help in some cases? IDK if e.g. using its hash-based uniq here would be more efficient or less than our very ad-hoc looking sort-then-dedup.

  • I personally really like splitting up huge multi-step functions like the above-mentioned finalize_transcript into individual steps that are clear in what they receive and what they emit. Every function that has // 1. do first step: … could instead call another function for this step. custom structs can help holding common context for this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions