Skip to content

feat: TemplateIterator#27

Open
msto wants to merge 6 commits into
ms_20_templatefrom
ms_20_template-iterator
Open

feat: TemplateIterator#27
msto wants to merge 6 commits into
ms_20_templatefrom
ms_20_template-iterator

Conversation

@msto

@msto msto commented Dec 7, 2025

Copy link
Copy Markdown

Closes #20.

Adds TemplateIterator, a streaming iterator that groups consecutive
query-name-grouped BAM records into Template instances. Inspired by the
equivalent constructs in fgbio (Scala) and fgpyo (Python); like #26,
this was mostly Claude.

API

  • TemplateIterator<I> over any Iterator<Item = Result<Record, htslib::errors::Error>>,
    yielding Result<Template, TemplateIteratorError>.
  • IntoTemplateIterator extension trait so callers can write
    reader.records().templates() directly on a bam::Reader.
  • Iterator::size_hint and FusedIterator implemented.

Behavior worth knowing

  • Streaming-only, contiguity-based grouping: a non-grouped input (e.g.
    coordinate-sorted) yields multiple Templates for the same query name.
    The iterator does not sort or deduplicate. The rustdoc spells this out
    and points at upstream remedies (samtools sort -n, @HD SO:queryname).
  • A read error during a template's collection propagates immediately
    rather than producing a partial Template.

Test coverage

@msto msto self-assigned this Dec 7, 2025
Comment thread src/bam/mod.rs
Comment on lines +39 to +50
/// Errors that can occur when iterating over templates.
#[derive(Error, Debug)]
pub enum TemplateIteratorError {
/// Error building a template from records.
#[error("Failed to build template: {0}")]
TemplateBuildError(#[from] TemplateError),

/// Error reading a BAM record.
#[error("Failed to read BAM record: {0}")]
BamReadError(#[from] rust_htslib::errors::Error),
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 @tfenne I am not sure as to your preference for Error specificity in Rust - is this okay, or would you prefer they be wrapped into FgError?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, and fwiw, I would have done what you've done

Comment thread src/bam/mod.rs Outdated
Comment on lines +264 to +271
/// for template in TemplateIterator::new(reader.records()) {
/// let template = template?;
/// println!("Template: {:?}", template.name());
/// }
/// ```
pub struct TemplateIterator<I>
where
I: Iterator<Item = Result<Record, rust_htslib::errors::Error>>,

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't convinced that this should be an iterator over Result<Record, E> instead of an iterator over Record, because the return type of Reader.records() is Records, but claude pointed out that where Records implements Iterator, it's over Result<Record, E>.

https://docs.rs/rust-htslib/latest/rust_htslib/bam/struct.Records.html

https://github.com/rust-bio/rust-htslib/blob/8f1cdd75c301cb1e0ae54125a33de065c2550225/src/tbx/mod.rs#L328-L346

Comment thread src/bam/mod.rs Outdated
//! fgbio (Scala) and fgpyo (Python).

use rust_htslib::bam::Record;
use std::iter::Peekable;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice that this is part of the std lib!

Comment thread src/bam/mod.rs
Comment on lines +350 to +366
/// Extension trait for creating a [`TemplateIterator`] from BAM record iterators.
///
/// This trait provides an ergonomic way to convert a BAM record iterator into
/// a template iterator using method chaining.
///
/// # Example
///
/// ```ignore
/// use fgoxide::bam::IntoTemplateIterator;
/// use rust_htslib::bam::Reader;
///
/// let reader = Reader::from_path("queryname_sorted.bam")?;
/// for template in reader.records().templates() {
/// let template = template?;
/// // process template...
/// }
/// ```

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cute and I would not have discovered it on my own for a while. Thanks Claude!

Comment thread src/bam/mod.rs
}

mod template_iterator_tests {
use super::*;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel better if we had one test that read from an actual BAM file. @nh13 @tfenne I know the general preference is to synthesize test data on the fly, but would you be open to having one test case to read a small BAM? Maybe two read pairs? Otherwise we're not directly testing that TemplateIterator can wrap rust-htslib's Reader::from_path

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2cents is that a small SAM is good enough for now, as long as we log an issue that we should replace this with a Sam Builder later.

@msto msto marked this pull request as ready for review December 7, 2025 00:50
@msto msto requested review from nh13 and tfenne as code owners December 7, 2025 00:50
@msto msto requested a review from theJasonFan December 7, 2025 00:50
@msto msto assigned nh13 and tfenne and unassigned msto Dec 7, 2025
Comment thread src/bam/mod.rs
Comment on lines +39 to +50
/// Errors that can occur when iterating over templates.
#[derive(Error, Debug)]
pub enum TemplateIteratorError {
/// Error building a template from records.
#[error("Failed to build template: {0}")]
TemplateBuildError(#[from] TemplateError),

/// Error reading a BAM record.
#[error("Failed to read BAM record: {0}")]
BamReadError(#[from] rust_htslib::errors::Error),
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, and fwiw, I would have done what you've done

Comment thread src/bam/mod.rs Outdated

// Collect all records with the same query name
let mut recs = Vec::new();
while let Some(Ok(rec)) = self.inner.peek() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what happens if an error reading occurs reading after the first record? Do we get a partial template? Do we want to return the error immediately?

Comment thread src/bam/mod.rs
}

mod template_iterator_tests {
use super::*;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2cents is that a small SAM is good enough for now, as long as we log an issue that we should replace this with a Sam Builder later.

Comment thread src/bam/mod.rs

// Build the template from collected records
Some(Template::build(recs).map_err(TemplateIteratorError::TemplateBuildError))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:
We could add size_hint() as it helps collect() pre-allocate capacity (eg. https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.size_hint)

size_hint() is primarily intended to be used for optimizations such as reserving space for the elements of the iterator, but must not be trusted to e.g., omit bounds checks in unsafe code. An incorrect implementation of size_hint() should not lead to memory safety violations.

Suggested change
}
}
fn size_hint(&self) -> (usize, Option<usize>) {
// Lower bound is 0 (could be all same qname)
// Upper bound is inner's upper bound (each record could be its own template)
let (_, upper) = self.inner.size_hint();
(0, upper)
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude suggested I do this as well, but I didn't know enough to determine if it was worthwhile or common practice!

Is this a good habit to be getting in? cc @theJasonFan

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea if the hint is easy to reason about. Here, we can read self.inner: Peekable<T>'s implementation of size hint and reason about it's correctness.

The advice from the docs:

That said, the implementation should provide a correct estimation, because otherwise it would be a violation of the trait’s protocol.

FWIW: there's also https://doc.rust-lang.org/std/iter/trait.ExactSizeIterator.html when you know the exact size of an iterator.

Comment thread src/bam/mod.rs
/// # Requirements
///
/// The input BAM must be **query-name sorted or grouped** (i.e., all records with the
/// same query name must be adjacent). The iterator does NOT sort records internally.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Can we be more explicit about what happens with unsorted input?

/// # Requirements
///
/// The input BAM must be **query-name sorted or grouped** (i.e., all records with the
/// same query name must be adjacent). The iterator does NOT sort records internally.
///
/// **Warning:** If records are not properly grouped, templates may be split across
/// multiple `Template` instances, or worse, different templates' records may be
/// incorrectly combined.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Is the updated documentation sufficient, or would you also like to add a strict flag to raise an error if R1 and R2 do not both appear? (maybe paired_strict?)

or worse, different templates' records may be incorrectly combined.

I don't think this is true, fwiw - I think the only consequence would be a Template containing a partial subset of the alignments associated with a template. I don't think there's a code path that would permit multiple alignments with different querynames being assigned to the same Template.

Comment thread src/bam/mod.rs Outdated
///
/// # Example
///
/// ```ignore

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: shall we use no_run instead to at least compile this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make this runnable too.

Comment thread src/bam/mod.rs
@@ -1,9 +1,12 @@
//! Types and utilities for working with BAM/SAM alignment data.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Does this need to be re-exported in src/lib.rs?

@msto msto assigned msto and unassigned nh13 and tfenne Apr 15, 2026
@msto msto force-pushed the ms_20_template branch 2 times, most recently from d2b624c to e9b4489 Compare May 7, 2026 20:51
@msto msto force-pushed the ms_20_template-iterator branch 2 times, most recently from 0b85509 to 5916297 Compare May 7, 2026 21:01
msto and others added 6 commits May 7, 2026 17:02
FusedIterator is a marker trait that guarantees an iterator will always
return None after returning None the first time. Most iterators behave
this way naturally, but Rust doesn't assume it.

Benefits of implementing FusedIterator:
- Iterator::fuse() becomes a no-op, allowing the compiler to optimize
  away redundant fuse() calls
- Enables downstream optimizations for code consuming the iterator
- Documents the iterator's behavior to API users

TemplateIterator qualifies because once the inner iterator is exhausted,
there's no way to produce more templates.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move the specific Item type bound to impl blocks, keeping only the
minimal `I: Iterator` bound required by Peekable<I> on the struct
definition.

This follows the Rust convention of placing bounds on impl blocks
rather than struct definitions when possible, making the type more
flexible and improving compile-time error messages.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace verbose match/loop pattern with idiomatic `while let` for
collecting records with the same query name.

The `while let Some(Ok(rec)) = self.inner.peek()` pattern is cleaner
and handles the None and Err cases implicitly by exiting the loop,
which is the desired behavior (errors are deferred to the next
iteration).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The #[must_use] attribute warns when the return value of new() is
discarded. Since TemplateIterator::new() creates an iterator that
must be consumed to have any effect, discarding it is almost certainly
a bug.

This follows the Rust API guidelines for constructors that return
values which should not be ignored.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- TemplateIterator::next propagates read errors immediately rather than
  yielding a partial Template followed by the error on the next call
- Implement Iterator::size_hint
- Drop per-template Vec<u8> qname allocation
- Doctests: ignore -> no_run with rust_htslib::bam::Read trait imported
- Sharpen the unsorted-input warning (records of different qnames cannot
  be combined into a single Template)
- Add on-disk BAM round-trip test using bam::Writer + TempDir
- Add tests covering the fail-fast read-error path
- Switch Template::build call to Template::new (renamed in rebased base)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@msto msto force-pushed the ms_20_template-iterator branch from 5916297 to d13c457 Compare May 7, 2026 21:04
@msto msto assigned nh13 and unassigned msto May 7, 2026
@msto msto requested a review from nh13 May 7, 2026 21:07
Comment thread src/bam/mod.rs
/// Converts this iterator into a [`TemplateIterator`].
///
/// The input must be query-name sorted or grouped.
fn templates(self) -> TemplateIterator<Self::InnerIter>;

@theJasonFan theJasonFan May 7, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm snooping. It would be more idiomatic to name this into_template_iter(). It's useful to know that self is moved and consumed. See ad-hoc conventions for as_, to_ into_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants