Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: support IUPAC bases - attempt 2 #54

Merged
merged 5 commits into from
Mar 14, 2025
Merged

Conversation

nh13
Copy link
Member

@nh13 nh13 commented Feb 10, 2025

Alternative to #53

I found that bio-seq was significantly slower than using bitenc from bio when there were A LOT of barcodes that didn't match. Rather than pulling in all of rust-bio, I copied bitenc from rust-bio, then added the hamming method (that's slightly faster than our current method) and associated tests.

@nh13 nh13 requested a review from tfenne as a code owner February 10, 2025 23:25
@nh13 nh13 mentioned this pull request Feb 10, 2025
@@ -0,0 +1,571 @@
// Copyright 2014-2016 Johannes Köster.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled directly from rust-bio so we don't have to have it as a full dependency.

self.len == 0
}

/// Calculate the Hamming distance between this and another bitencoded sequence.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new!

}

#[test]
fn test_hamming() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is new!

Copy link

@jdidion jdidion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dnr bitenc.rs - please lmk if you want me to come back and look at that as well

@nh13 nh13 force-pushed the feature/iupac-support-bio branch 2 times, most recently from 4dc2274 to 877f21c Compare March 8, 2025 00:25
@nh13 nh13 requested a review from jdidion March 8, 2025 00:26
src/lib/mod.rs Outdated
pub mod samples;

use crate::bitenc::BitEnc;
use lazy_static::lazy_static;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no longer necessary to use lazy_static thanks to the addition of std::sync::LazyLock. If this was legacy I'd say just add it as a todo, but since it's new I'd say do it in this PR to avoid introducing a new unnecessary dependency.

@@ -1,14 +1,97 @@
pub mod barcode_matching;
pub mod bitenc;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this needs to be exposed externally?

/// N, it will not match anything but an N, and if the other base is an R, it
/// will match R, V, D, and N, since the latter IUPAC codes allow both A and G.
pub fn hamming(&self, other: &BitEnc, max_mismatches: u32) -> u32 {
assert!(self.len == other.len, "Both bitenc sequences must have the same length");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this module were internal only I wouldn't care as much about this, but you're exporting it so panicking here might lead to unpleasant surprises. At a minimum, you should document this potential panic. However, I would strongly recommend to instead return a Result with a custom error type.

/// will match R, V, D, and N, since the latter IUPAC codes allow both A and G.
pub fn hamming(&self, other: &BitEnc, max_mismatches: u32) -> u32 {
assert!(self.len == other.len, "Both bitenc sequences must have the same length");
assert!(self.width == other.width, "Both bitenc sequences must have the same width");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

let values_per_block = self.usable_bits_per_block / self.width;
for block_index in 0..self.nr_blocks() {
let intersection = self.storage[block_index] & other.storage[block_index];
if intersection != self.storage[block_index] {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly complicated. I think what you want is self.storage[block_index] & !other.storage[block_index].

assert!(self.width == other.width, "Both bitenc sequences must have the same width");
let mut count: u32 = 0;
let values_per_block = self.usable_bits_per_block / self.width;
for block_index in 0..self.nr_blocks() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this could be written using (0..self.nr_blocks()).into_iter().scan(..) to avoid the need for let mut.

@nh13 nh13 requested a review from jdidion March 13, 2025 17:50
@nh13 nh13 force-pushed the feature/iupac-support-bio branch from baefa86 to eb401f2 Compare March 13, 2025 17:52
@nh13 nh13 force-pushed the feature/iupac-support-bio branch from eb401f2 to 4c0de2a Compare March 13, 2025 17:57
@nh13 nh13 merged commit c0de006 into main Mar 14, 2025
6 checks passed
@nh13 nh13 deleted the feature/iupac-support-bio branch March 14, 2025 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants