Conversation
sampsyo
left a comment
There was a problem hiding this comment.
Hi, @ashwinsawant17! To bring this to the next step, can you please do a couple of minor "homework" tasks?
- Split this into a separate file, probably
index.rsor something. It would be great to make sure this stays distinct from the core FlatGFA data structures. - Run
cargo fmt. Doing this is always a good idea when opening/updating a PR so your reviewers don't get distracted by formatting details.
sampsyo
left a comment
There was a problem hiding this comment.
Oops, I hit "submit" before including granular code-level comments. Here are some suggestions on the work so far!
flatgfa/src/flatgfa.rs
Outdated
|
|
||
|
|
||
| /// helper to extract the segment index from the stepref | ||
| fn segment_of_step(fgfa: &FlatGFA, step: &StepRef) -> usize { |
There was a problem hiding this comment.
You can move this function to the top level, because it doesn't seem to reference the context.
There was a problem hiding this comment.
Moved to the top level in index.rs in the latest commit.
flatgfa/src/flatgfa.rs
Outdated
|
|
||
|
|
||
| /// helper to extract the segment index from the stepref | ||
| fn segment_of_step(fgfa: &FlatGFA, step: &StepRef) -> usize { |
There was a problem hiding this comment.
Maybe this would be a little clearer/more type-safe if it returned an Id<Segment> instead of a plain u32?
flatgfa/src/flatgfa.rs
Outdated
|
|
||
| // organize by the index of the segment in the segment pool | ||
| all_steps.sort_by_key(|a| { | ||
| segment_of_step(fgfa, a) |
There was a problem hiding this comment.
It occurs to me that we could maybe make this a little more efficient by preserving the segment ID that we already had available in the previous stanza. That is, when we do the .enumerate() iteration, we know the segment ID at that point—so we could simply store that in the array. The array would then store (Id<Segment>, PathRef) pairs, which we could then sort conventionally without needing a custom sort key (that entails another lookup per element).
flatgfa/src/flatgfa.rs
Outdated
| all_steps.sort_by_key(|a| { | ||
| segment_of_step(fgfa, a) | ||
| }); | ||
|
|
There was a problem hiding this comment.
This could perhaps use a little bit more of a long comment here describing the strategy for the rest of the function. The idea is that, now that we've sorted stuff, we now need to identify the "runs" of PathRefs that are for the same segment; those "runs" become the spans that go in segment_steps. But a little explanation of how that works would go a long way…
…get pushed earlier
sampsyo
left a comment
There was a problem hiding this comment.
Here are a few initial comments!
| // The first traversal of this path over this segment. | ||
| uniq_depths[seg_id] += 1; | ||
| seen.set(seg_id, true); | ||
| if use_index { |
There was a problem hiding this comment.
Because these two routes have such completely different implementations, I say let's just put them in separate functions. It will make each one easier to read.
|
|
||
| // sort the steprefs by the index of the segment in the segment pool | ||
| // by extracting the actual numeric index from the Id<Segment> | ||
| all_steps.sort_by_key(|a| a.0.index()); |
There was a problem hiding this comment.
You mentioned you weren't sure whether this addressed my comment about sorting stuff. It does! This is exactly what I was thinking.
| impl StepsBySegIndex { | ||
| pub fn new(fgfa: &FlatGFA) -> Self { | ||
| // will be our `steps` vector that contains all steprefs | ||
| let mut all_steps = Vec::new(); |
There was a problem hiding this comment.
This would be a bit clearer and more efficient using a Vec::collect() to avoid the mutability and the pushes in a loop. Here's how that would look:
let all_steps: Vec<_> = fgfa.paths.items().map(|(path_id, path)| {
// your loop body here
(seg, step)
}).collect();
sampsyo
left a comment
There was a problem hiding this comment.
And a couple more comments about building up the vector of spans.
| for _ in 0..fgfa.segs.len() { | ||
| segment_steps.push(Span::new_empty()); | ||
| } |
There was a problem hiding this comment.
If you do want to initialize a big array, there is a short syntax for that too: vec![n; initial]. So something like vec![fgfa.segs.len(); Span::new_empty()].
There was a problem hiding this comment.
This was my initial thought! But I ran into errors with Spans not being cloneable (or something similar, I'll update this comment with the exact error I got).
There was a problem hiding this comment.
On further inspection, I think I would just need to make sure that StepRef needs to be cloneable. I think that's the reason I was running into issues earlier.
There was a problem hiding this comment.
Yeah, we could make it cloneable!
| // TODO: we definitely don't need to do another iteration to fill this with empty spans | ||
| // It's likely more efficient to push empty spans as needed |
There was a problem hiding this comment.
I think it would be worth measuring the cost. You could do a little benchmarking with Hyperfine to see what happens if you make these uninitialized.
| let new_span: Span<StepRef> = Span::new(Id::new(span_start), Id::new(i)); | ||
|
|
||
| // assign the span to the index in segment_steps that maps to the index of the segment in the FlatGFA segment pool | ||
| segment_steps[seg_ind.index()] = new_span; |
There was a problem hiding this comment.
Because you are starting with an initialized array of spans, you could just mutate it in place instead of replacing the old one. It is also worth measuring whether this makes a difference too…
I created the initialization of the index, and some basic get functions for the slice of StepRefs and its length (given by the len() function for Spans).