-
Notifications
You must be signed in to change notification settings - Fork 2
Actually use PackedSeq in FlatGFA #241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
johnpalsberg
wants to merge
62
commits into
main
Choose a base branch
from
john-performance
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
62 commits
Select commit
Hold shift + click to select a range
f8dadda
Add zerocopy interfacing
johnpalsberg fe3972b
Attempt to resolve merge conflicts
johnpalsberg 137ed25
Add a rust and a bash test case for writing and reading compressed da…
johnpalsberg f394aad
Make progress on adding the command line interface
johnpalsberg 06fef05
Clean up code for push
johnpalsberg 2ffc343
Make rand a dev-dependency
johnpalsberg 1f7a76e
Add a subslice method to PackedSeqView
johnpalsberg ef0b96f
Add SeqSpan to flatgfa.rs
johnpalsberg 293a296
Add a from_pool method to PackedSeqView
johnpalsberg 247847e
Continue progressing with integration
johnpalsberg e8a2b41
Add a turnt test directory to flatgfa
johnpalsberg 1a876fc
Complete the compressed Sequence implementation
johnpalsberg 28e6ab2
Add turnt test case for the cli
johnpalsberg 7f91260
Compress sequences when adding them to a GFA file
johnpalsberg f954e34
Fix remaining compile errors across codebase and make all test cases …
johnpalsberg 502d5ae
Address Clippy warnings
johnpalsberg d3f0266
Add changes from compress-cli-refine
johnpalsberg 4d2cef1
Merge remote-tracking branch 'origin/main' into john-zerocopy
johnpalsberg d8cd462
Clean up after merge
johnpalsberg 7341185
Update Turnt tests
johnpalsberg 9dc3aa5
progress on debugging
johnpalsberg 5931f1a
Convert ASCII to internal nucleotide encoding when adding segments
johnpalsberg de3acc4
Fix Clippy warnings
johnpalsberg e7dbb94
Merge branch 'main' into john-zerocopy
johnpalsberg b7b7822
Add a 'N' variant to the Nucleotide enum
johnpalsberg d46499e
Fix python CI failures attempt #1
johnpalsberg 15c11ef
Fix index issue with slice() function
johnpalsberg 185a1fb
More indexing fixes
johnpalsberg c74d916
Fix indexing errors attempt #2
johnpalsberg c0ca7fb
Fix indexing errors attempt #3
johnpalsberg e03ab24
Fix Python dependencies
johnpalsberg 6d0994c
Change SeqSpan to be half-open (exclude end byte)
johnpalsberg 8ca33dd
Fix the from_range() method in SeqSpan
johnpalsberg 466f5a6
Fix turnt test in flatgfa
johnpalsberg 2cbc7a3
Fix Python ruff formatting
johnpalsberg 0cb4e6b
Fix the indices in chop
johnpalsberg 1e8f0dc
Use slow_odgi instead of uv run slow_odgi
johnpalsberg b7a3e3d
Clean up code for review
johnpalsberg 4083d99
Add performance and file size tests for compression
johnpalsberg 5b27e63
Use the hyperfine() function, and implement a couple more tests
johnpalsberg 2be9d63
sync
johnpalsberg edff087
Test-only commit
johnpalsberg c9febd5
Test commit #2
johnpalsberg 1da2970
Change SeqSpan index types from usize to u32
johnpalsberg 13324af
Add logical indexing for SeqSpan
johnpalsberg 764b90e
Add base + offset indexing
johnpalsberg 59d7b1b
Add new cli command for printing file size statistics
johnpalsberg 96131f6
Enable CI tests for all pushes to john-performance
johnpalsberg cc0ed9c
Fix an import bug
johnpalsberg 3c7c82c
Byte fix (in progress)
johnpalsberg 38cb1d9
Checkpoint for base + offset version
johnpalsberg 7e22822
Fix bugs for base + end version
johnpalsberg b853ffa
Merge branch 'main' into john-performance
johnpalsberg b72e8af
Fix additional inconsistencies with main
johnpalsberg 576837c
additional clippy fixes
johnpalsberg b9b8867
delete unnecessary files
johnpalsberg d40db39
add base + offset
johnpalsberg 5aa1abc
Fix Clippy warnings
johnpalsberg eeb964a
Address most PR comments (except extra-copy)
johnpalsberg a3a2b44
Resolve more PR change requests
johnpalsberg 7e94285
Make small clean up
johnpalsberg ac0e40c
Fix a bug in the compression
johnpalsberg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| name: performance | ||
|
|
||
|
|
||
| on: | ||
| push: | ||
| branches: | ||
| - john-performance | ||
|
|
||
| jobs: | ||
| test-py: | ||
| name: test Python tools | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.12" | ||
|
|
||
| # Set up and use uv. | ||
| - uses: actions/cache@v4 | ||
| id: cache-uv | ||
| with: | ||
| path: ~/.cache/uv | ||
| key: ${{ runner.os }}-python-${{ matrix.python-version }}-uv | ||
| - name: uv sync and activate | ||
| run: | | ||
| curl -LsSf https://astral.sh/uv/install.sh | sh | ||
| uv sync | ||
| echo "VIRTUAL_ENV=.venv" >> $GITHUB_ENV | ||
| echo "$PWD/.venv/bin" >> $GITHUB_PATH | ||
|
|
||
| # Set up for tests. | ||
| - name: Problem matcher | ||
| run: echo '::add-matcher::.github/tap-matcher.json' | ||
| - name: Fetch test data | ||
| run: make fetch SMALL=1 | ||
|
|
||
| - name: Pull odgi container | ||
| run: | | ||
| docker pull quay.io/biocontainers/odgi:0.8.6--py310hdf79db3_1 | ||
| docker tag quay.io/biocontainers/odgi:0.8.6--py310hdf79db3_1 odgi | ||
| - name: Install odgi alias | ||
| run: | | ||
| mkdir -p $HOME/.local/bin | ||
| cp .github/odgi.sh $HOME/.local/bin/odgi | ||
| chmod a+x $HOME/.local/bin/odgi | ||
|
|
||
| # Test slow_odgi. | ||
| - name: Set up for slow_odgi tests | ||
| run: make -C slow_odgi setup oracles SMALL=1 | ||
| - name: Test slow_odgi | ||
| run: make -C slow_odgi test SMALL=1 | ||
|
|
||
| test-flatgfa: | ||
| name: test FlatGFA | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - run: rustup toolchain install stable --no-self-update | ||
|
|
||
| # Install slow-odgi. | ||
| - uses: actions/cache@v4 | ||
| id: cache-uv | ||
| with: | ||
| path: ~/.cache/uv | ||
| key: ${{ runner.os }}-python-${{ matrix.python-version }}-uv | ||
| - name: uv sync and activate | ||
| run: | | ||
| curl -LsSf https://astral.sh/uv/install.sh | sh | ||
| uv sync | ||
| echo "VIRTUAL_ENV=.venv" >> $GITHUB_ENV | ||
| echo "$PWD/.venv/bin" >> $GITHUB_PATH | ||
|
|
||
| # Install odgi | ||
| - name: Pull odgi container | ||
| run: | | ||
| docker pull quay.io/biocontainers/odgi:0.8.6--py310hdf79db3_1 | ||
| docker tag quay.io/biocontainers/odgi:0.8.6--py310hdf79db3_1 odgi | ||
| - name: Install odgi alias | ||
| run: | | ||
| mkdir -p $HOME/.local/bin | ||
| cp .github/odgi.sh $HOME/.local/bin/odgi | ||
| chmod a+x $HOME/.local/bin/odgi | ||
|
|
||
| # Install Turnt. | ||
| - uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.12" | ||
| - name: Install Turnt | ||
| run: pip install turnt | ||
| - name: Problem matcher | ||
| run: echo '::add-matcher::.github/tap-matcher.json' | ||
|
|
||
| # We need the test data. | ||
| - name: Fetch test data | ||
| run: make fetch SMALL=1 | ||
|
|
||
| # Build and test. | ||
| - run: cargo build | ||
| working-directory: ./flatgfa | ||
| - run: cargo test | ||
| working-directory: ./flatgfa | ||
| - run: make test-flatgfa |
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like diff noise? |
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,7 +3,10 @@ | |
| use std::ops::Range; | ||
| use std::str::FromStr; | ||
|
|
||
| use crate::pool::{self, Id, Pool, Span, Store}; | ||
| use crate::{ | ||
| packedseq::{PackedSeqView, SeqSpan}, | ||
| pool::{self, Id, Pool, Span, Store}, | ||
| }; | ||
| use bstr::BStr; | ||
| use num_enum::{IntoPrimitive, TryFromPrimitive}; | ||
| use zerocopy::{FromBytes, Immutable, IntoBytes}; | ||
|
|
@@ -75,7 +78,7 @@ pub struct Segment { | |
| pub name: usize, | ||
|
|
||
| /// The base-pair sequence for the segment. This is a range in the `seq_data` pool. | ||
| pub seq: Span<u8>, | ||
| pub seq: SeqSpan, | ||
|
|
||
| /// Segments can have optional fields. This is a range in the `optional_data` pool. | ||
| pub optional: Span<u8>, | ||
|
|
@@ -274,7 +277,7 @@ pub enum LineKind { | |
| /// This is mostly a `&[u8]`, but it also has a flag to indicate that we're | ||
| /// representing the reverse-complement of the underlying sequence data. | ||
| pub struct Sequence<'a> { | ||
| data: &'a [u8], | ||
| data: PackedSeqView<'a>, | ||
| revcmp: bool, | ||
| } | ||
|
|
||
|
|
@@ -284,7 +287,7 @@ impl<'a> Sequence<'a> { | |
| /// `data` should be the "forward" version of the sequence. Use `ori` to | ||
| /// indicate whether this `Sequence` represents the forward or backward | ||
| /// (reverse complement) of that data. | ||
| pub fn new(data: &'a [u8], ori: Orientation) -> Self { | ||
| pub fn new(data: PackedSeqView<'a>, ori: Orientation) -> Self { | ||
| Self { | ||
| data, | ||
| revcmp: ori == Orientation::Backward, | ||
|
|
@@ -294,9 +297,9 @@ impl<'a> Sequence<'a> { | |
| /// Look up a single base pair in the sequence. | ||
| pub fn index(&self, idx: usize) -> u8 { | ||
| if self.revcmp { | ||
| nucleotide_complement(self.data[self.data.len() - idx - 1]) | ||
| self.data.get(self.data.len() - idx - 1).complement().into() | ||
| } else { | ||
| self.data[idx] | ||
| self.data.get(idx).into() | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -305,9 +308,10 @@ impl<'a> Sequence<'a> { | |
| let data = if self.revcmp { | ||
| // The range starts at the end of the buffer: | ||
| // [-----<end<******<start<------] | ||
| &self.data[(self.data.len() - range.end)..(self.data.len() - range.start)] | ||
| self.data | ||
| .range_slice((self.data.len() - range.end)..(self.data.len() - range.start)) | ||
| } else { | ||
| &self.data[range] | ||
| self.data.range_slice(range) | ||
| }; | ||
| Self { | ||
| data, | ||
|
|
@@ -321,29 +325,14 @@ impl<'a> Sequence<'a> { | |
| self.data | ||
| .iter() | ||
| .rev() | ||
| .map(|&c| nucleotide_complement(c)) | ||
| .map(|c| c.complement().into()) | ||
| .collect() | ||
| } else { | ||
| self.data.to_vec() | ||
| self.data.iter().map(|c| c.into()).collect() | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /// Given an ASCII character for a nucleotide, get its complement. | ||
| fn nucleotide_complement(c: u8) -> u8 { | ||
| match c { | ||
| b'A' => b'T', | ||
| b'T' => b'A', | ||
| b'C' => b'G', | ||
| b'G' => b'C', | ||
| b'a' => b't', | ||
| b't' => b'a', | ||
| b'c' => b'g', | ||
| b'g' => b'c', | ||
| x => x, | ||
| } | ||
| } | ||
|
|
||
| impl std::fmt::Display for Sequence<'_> { | ||
| fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
| if self.revcmp { | ||
|
|
@@ -354,16 +343,16 @@ impl std::fmt::Display for Sequence<'_> { | |
| let bytes = self.to_vec(); | ||
| write!(f, "{}", BStr::new(&bytes))?; | ||
| } else { | ||
| write!(f, "{}", BStr::new(self.data))?; | ||
| write!(f, "{}", self.data)?; | ||
| } | ||
| Ok(()) | ||
| } | ||
| } | ||
|
|
||
| impl<'a> FlatGFA<'a> { | ||
| /// Get the base-pair sequence for a segment. | ||
| pub fn get_seq(&self, seg: &Segment) -> &BStr { | ||
| self.seq_data[seg.seq].as_ref() | ||
| pub fn get_seq(&self, seg: &Segment) -> PackedSeqView<'_> { | ||
| PackedSeqView::from_pool(self.seq_data, seg.seq) | ||
| } | ||
|
|
||
| /// Get the sequence that a *handle* refers to. | ||
|
|
@@ -372,7 +361,7 @@ impl<'a> FlatGFA<'a> { | |
| /// gets the sequence in the orientation specified by the handle. | ||
| pub fn get_seq_oriented(&self, handle: Handle) -> Sequence<'_> { | ||
| let seg = self.get_handle_seg(handle); | ||
| let seq_data = self.seq_data[seg.seq].as_ref(); | ||
| let seq_data = PackedSeqView::from_pool(self.seq_data, seg.seq); | ||
| Sequence::new(seq_data, handle.orient()) | ||
| } | ||
|
|
||
|
|
@@ -446,11 +435,62 @@ impl<'a, P: StoreFamily<'a>> GFAStore<'a, P> { | |
| self.header.add_slice(version); | ||
| } | ||
|
|
||
| /// Add a new segment to the GFA file. | ||
| pub fn add_seg(&mut self, name: usize, seq: &[u8], optional: &[u8]) -> Id<Segment> { | ||
| /// Add a new segment to the GFA file, compressing the data in `seq` | ||
| pub fn compress_and_add_seg( | ||
| &mut self, | ||
| name: usize, | ||
| seq: &[u8], | ||
| optional: &[u8], | ||
| ) -> Id<Segment> { | ||
| self.seq_data.reserve(seq.len()); | ||
| let mut high_nibble_end = true; | ||
| let mut combined_item = 0; | ||
| let start_id = self.seq_data.next_id(); | ||
| for i in 0..seq.len() { | ||
| let item = seq[i]; | ||
| let converted: u8 = match item { | ||
| 65 => 0, | ||
| 67 => 1, | ||
| 84 => 2, | ||
| 71 => 3, | ||
| 78 => 4, | ||
| _ => panic!("Not a Nucleotide!"), | ||
| }; | ||
| if high_nibble_end { | ||
| high_nibble_end = false; | ||
| if i == seq.len() - 1 { | ||
| self.seq_data.add(converted); | ||
| break; | ||
| } | ||
| combined_item = converted; | ||
| } else { | ||
| combined_item |= converted << 4; | ||
| self.seq_data.add(combined_item); | ||
| high_nibble_end = true; | ||
| } | ||
| } | ||
| let end_id = self.seq_data.next_id(); | ||
| let byte_span = Span::new(start_id, end_id); | ||
| let start = SeqSpan::to_logical(byte_span.start.index(), false); | ||
| let end = SeqSpan::to_logical(byte_span.end.index() - 1, high_nibble_end) + 1; | ||
| self.segs.add(Segment { | ||
| name, | ||
| seq: SeqSpan { | ||
| start, | ||
| len: (end - start) as u16, | ||
| }, | ||
|
Comment on lines
+478
to
+481
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would probably be a good place for using that |
||
| optional: self.optional_data.add_slice(optional), | ||
| }) | ||
| } | ||
|
|
||
| /// Add a new segment with already compressed data | ||
| pub fn add_seg(&mut self, name: usize, seq: PackedSeqView, optional: &[u8]) -> Id<Segment> { | ||
| let byte_span = self.seq_data.add_slice(seq.data); | ||
| let start = SeqSpan::to_logical(byte_span.start.index(), seq.high_nibble_begin); | ||
| let end = SeqSpan::to_logical(byte_span.end.index() - 1, seq.high_nibble_end) + 1; | ||
| self.segs.add(Segment { | ||
| name, | ||
| seq: self.seq_data.add_slice(seq), | ||
| seq: (start as usize..end as usize).into(), | ||
| optional: self.optional_data.add_slice(optional), | ||
| }) | ||
| } | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this belongs in a separate benchmarking PR.