feat: TemplateIterator#27
Conversation
| /// Errors that can occur when iterating over templates. | ||
| #[derive(Error, Debug)] | ||
| pub enum TemplateIteratorError { | ||
| /// Error building a template from records. | ||
| #[error("Failed to build template: {0}")] | ||
| TemplateBuildError(#[from] TemplateError), | ||
|
|
||
| /// Error reading a BAM record. | ||
| #[error("Failed to read BAM record: {0}")] | ||
| BamReadError(#[from] rust_htslib::errors::Error), | ||
| } | ||
|
|
There was a problem hiding this comment.
I don't have a strong preference, and fwiw, I would have done what you've done
| /// for template in TemplateIterator::new(reader.records()) { | ||
| /// let template = template?; | ||
| /// println!("Template: {:?}", template.name()); | ||
| /// } | ||
| /// ``` | ||
| pub struct TemplateIterator<I> | ||
| where | ||
| I: Iterator<Item = Result<Record, rust_htslib::errors::Error>>, |
There was a problem hiding this comment.
I wasn't convinced that this should be an iterator over Result<Record, E> instead of an iterator over Record, because the return type of Reader.records() is Records, but claude pointed out that where Records implements Iterator, it's over Result<Record, E>.
https://docs.rs/rust-htslib/latest/rust_htslib/bam/struct.Records.html
| //! fgbio (Scala) and fgpyo (Python). | ||
|
|
||
| use rust_htslib::bam::Record; | ||
| use std::iter::Peekable; |
There was a problem hiding this comment.
It's nice that this is part of the std lib!
| /// Extension trait for creating a [`TemplateIterator`] from BAM record iterators. | ||
| /// | ||
| /// This trait provides an ergonomic way to convert a BAM record iterator into | ||
| /// a template iterator using method chaining. | ||
| /// | ||
| /// # Example | ||
| /// | ||
| /// ```ignore | ||
| /// use fgoxide::bam::IntoTemplateIterator; | ||
| /// use rust_htslib::bam::Reader; | ||
| /// | ||
| /// let reader = Reader::from_path("queryname_sorted.bam")?; | ||
| /// for template in reader.records().templates() { | ||
| /// let template = template?; | ||
| /// // process template... | ||
| /// } | ||
| /// ``` |
There was a problem hiding this comment.
This is super cute and I would not have discovered it on my own for a while. Thanks Claude!
| } | ||
|
|
||
| mod template_iterator_tests { | ||
| use super::*; |
There was a problem hiding this comment.
I'd feel better if we had one test that read from an actual BAM file. @nh13 @tfenne I know the general preference is to synthesize test data on the fly, but would you be open to having one test case to read a small BAM? Maybe two read pairs? Otherwise we're not directly testing that TemplateIterator can wrap rust-htslib's Reader::from_path
There was a problem hiding this comment.
My 2cents is that a small SAM is good enough for now, as long as we log an issue that we should replace this with a Sam Builder later.
| /// Errors that can occur when iterating over templates. | ||
| #[derive(Error, Debug)] | ||
| pub enum TemplateIteratorError { | ||
| /// Error building a template from records. | ||
| #[error("Failed to build template: {0}")] | ||
| TemplateBuildError(#[from] TemplateError), | ||
|
|
||
| /// Error reading a BAM record. | ||
| #[error("Failed to read BAM record: {0}")] | ||
| BamReadError(#[from] rust_htslib::errors::Error), | ||
| } | ||
|
|
There was a problem hiding this comment.
I don't have a strong preference, and fwiw, I would have done what you've done
|
|
||
| // Collect all records with the same query name | ||
| let mut recs = Vec::new(); | ||
| while let Some(Ok(rec)) = self.inner.peek() { |
There was a problem hiding this comment.
question: what happens if an error reading occurs reading after the first record? Do we get a partial template? Do we want to return the error immediately?
| } | ||
|
|
||
| mod template_iterator_tests { | ||
| use super::*; |
There was a problem hiding this comment.
My 2cents is that a small SAM is good enough for now, as long as we log an issue that we should replace this with a Sam Builder later.
|
|
||
| // Build the template from collected records | ||
| Some(Template::build(recs).map_err(TemplateIteratorError::TemplateBuildError)) | ||
| } |
There was a problem hiding this comment.
suggestion:
We could add size_hint() as it helps collect() pre-allocate capacity (eg. https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.size_hint)
size_hint() is primarily intended to be used for optimizations such as reserving space for the elements of the iterator, but must not be trusted to e.g., omit bounds checks in unsafe code. An incorrect implementation of size_hint() should not lead to memory safety violations.
| } | |
| } | |
| fn size_hint(&self) -> (usize, Option<usize>) { | |
| // Lower bound is 0 (could be all same qname) | |
| // Upper bound is inner's upper bound (each record could be its own template) | |
| let (_, upper) = self.inner.size_hint(); | |
| (0, upper) | |
| } |
There was a problem hiding this comment.
Claude suggested I do this as well, but I didn't know enough to determine if it was worthwhile or common practice!
Is this a good habit to be getting in? cc @theJasonFan
There was a problem hiding this comment.
This is a good idea if the hint is easy to reason about. Here, we can read self.inner: Peekable<T>'s implementation of size hint and reason about it's correctness.
The advice from the docs:
That said, the implementation should provide a correct estimation, because otherwise it would be a violation of the trait’s protocol.
FWIW: there's also https://doc.rust-lang.org/std/iter/trait.ExactSizeIterator.html when you know the exact size of an iterator.
| /// # Requirements | ||
| /// | ||
| /// The input BAM must be **query-name sorted or grouped** (i.e., all records with the | ||
| /// same query name must be adjacent). The iterator does NOT sort records internally. |
There was a problem hiding this comment.
suggestion:
Can we be more explicit about what happens with unsorted input?
/// # Requirements
///
/// The input BAM must be **query-name sorted or grouped** (i.e., all records with the
/// same query name must be adjacent). The iterator does NOT sort records internally.
///
/// **Warning:** If records are not properly grouped, templates may be split across
/// multiple `Template` instances, or worse, different templates' records may be
/// incorrectly combined.
There was a problem hiding this comment.
Sure! Is the updated documentation sufficient, or would you also like to add a strict flag to raise an error if R1 and R2 do not both appear? (maybe paired_strict?)
or worse, different templates' records may be incorrectly combined.
I don't think this is true, fwiw - I think the only consequence would be a Template containing a partial subset of the alignments associated with a template. I don't think there's a code path that would permit multiple alignments with different querynames being assigned to the same Template.
| /// | ||
| /// # Example | ||
| /// | ||
| /// ```ignore |
There was a problem hiding this comment.
question: shall we use no_run instead to at least compile this?
| @@ -1,9 +1,12 @@ | |||
| //! Types and utilities for working with BAM/SAM alignment data. | |||
There was a problem hiding this comment.
question: Does this need to be re-exported in src/lib.rs?
d2b624c to
e9b4489
Compare
0b85509 to
5916297
Compare
FusedIterator is a marker trait that guarantees an iterator will always return None after returning None the first time. Most iterators behave this way naturally, but Rust doesn't assume it. Benefits of implementing FusedIterator: - Iterator::fuse() becomes a no-op, allowing the compiler to optimize away redundant fuse() calls - Enables downstream optimizations for code consuming the iterator - Documents the iterator's behavior to API users TemplateIterator qualifies because once the inner iterator is exhausted, there's no way to produce more templates. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move the specific Item type bound to impl blocks, keeping only the minimal `I: Iterator` bound required by Peekable<I> on the struct definition. This follows the Rust convention of placing bounds on impl blocks rather than struct definitions when possible, making the type more flexible and improving compile-time error messages. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace verbose match/loop pattern with idiomatic `while let` for collecting records with the same query name. The `while let Some(Ok(rec)) = self.inner.peek()` pattern is cleaner and handles the None and Err cases implicitly by exiting the loop, which is the desired behavior (errors are deferred to the next iteration). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The #[must_use] attribute warns when the return value of new() is discarded. Since TemplateIterator::new() creates an iterator that must be consumed to have any effect, discarding it is almost certainly a bug. This follows the Rust API guidelines for constructors that return values which should not be ignored. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- TemplateIterator::next propagates read errors immediately rather than yielding a partial Template followed by the error on the next call - Implement Iterator::size_hint - Drop per-template Vec<u8> qname allocation - Doctests: ignore -> no_run with rust_htslib::bam::Read trait imported - Sharpen the unsorted-input warning (records of different qnames cannot be combined into a single Template) - Add on-disk BAM round-trip test using bam::Writer + TempDir - Add tests covering the fail-fast read-error path - Switch Template::build call to Template::new (renamed in rebased base) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5916297 to
d13c457
Compare
| /// Converts this iterator into a [`TemplateIterator`]. | ||
| /// | ||
| /// The input must be query-name sorted or grouped. | ||
| fn templates(self) -> TemplateIterator<Self::InnerIter>; |
There was a problem hiding this comment.
nit: I'm snooping. It would be more idiomatic to name this into_template_iter(). It's useful to know that self is moved and consumed. See ad-hoc conventions for as_, to_ into_
Closes #20.
Adds
TemplateIterator, a streaming iterator that groups consecutivequery-name-grouped BAM records into
Templateinstances. Inspired by theequivalent constructs in
fgbio(Scala) andfgpyo(Python); like #26,this was mostly Claude.
API
TemplateIterator<I>over anyIterator<Item = Result<Record, htslib::errors::Error>>,yielding
Result<Template, TemplateIteratorError>.IntoTemplateIteratorextension trait so callers can writereader.records().templates()directly on abam::Reader.Iterator::size_hintandFusedIteratorimplemented.Behavior worth knowing
coordinate-sorted) yields multiple
Templates for the same query name.The iterator does not sort or deduplicate. The rustdoc spells this out
and points at upstream remedies (
samtools sort -n,@HD SO:queryname).rather than producing a partial
Template.Test coverage
secondaries and supplementaries.
bam::Writer→bam::Reader::from_pathtoexercise the real htslib record stream. (Will be replaced with a
SamBuilderonce that lands — tracked in Replace synthesized BAM records in TemplateIterator round-trip test with SamBuilder #32.)