Skip to content

feat: add --outSAMattrRGline with @RG header and RG:Z tags#3

Closed
ewels wants to merge 6 commits into
mainfrom
feat/sam-rg-line
Closed

feat: add --outSAMattrRGline with @RG header and RG:Z tags#3
ewels wants to merge 6 commits into
mainfrom
feat/sam-rg-line

Conversation

@ewels
Copy link
Copy Markdown
Owner

@ewels ewels commented Apr 17, 2026

Summary

Adds the --outSAMattrRGline CLI flag, emits @RG lines in the SAM header, and tags every record with RG:Z:<ID>. Matches STAR's semantics (Parameters_readFilesInit.cpp:65-93, samHeaders.cpp, ReadAlign_alignBAM.cpp:333).

Part 2 of 3 branches unblocking nf-core/rnaseq's STAR_ALIGN step. Standalone PR; does not depend on the others.

Changes

  • src/params.rs: out_sam_attr_rg_line: Vec<String> arg + parsed_rg_lines() / rg_ids() / rg_line_set() helpers. STAR's comma-separated multi-RG syntax supported. Auto-appends RG to sam_attribute_set() when a line is set; validate() errors on RG in attrs without a line (matches STAR).
  • src/io/sam.rs: build_sam_header() now emits @RG lines (deduplicated by ID). New maybe_insert_rg_tag() helper inserts RG:Z:<id> on every record (main, mate2, half-mapped, unmapped).
  • src/io/bam.rs: call-site updates for the new signatures.
  • src/lib.rs: computes RG ID once per run; passes through SE and PE paths.
  • +9 tests (total 287 unit + 4 integration = 291 passing).

Design note

The RG tag is inserted post-build via maybe_insert_rg_tag() rather than threaded through every builder signature. Same output, less churn.

Follow-up commits

  • 4 simplify commits: drop dead validate call + redundant attrs gate on maybe_insert_rg_tag; add Parameters::primary_rg_id helper to replace a recurring allocation pattern; rewrite parsed_rg_lines() using Vec::split; remove dead SamWriter::write_unmapped method.
  • 1 cargo fmt commit to fix pre-existing drift inherited from main.

Test plan

  • cargo test — 286 unit + 4 integration (−1 from removing the self-test of the deleted dead code)
  • cargo clippy --lib -- -D warnings — clean
  • cargo fmt --check — clean
  • End-to-end against nf-core/rnaseq (verify @RG + RG:Z: via samtools view -H and samtools view | grep RG:Z:)

Notes

  • Commits are unsigned (1Password SSH agent unavailable during session).
  • parsed_rg_lines() now errors on "empty block after trailing comma" instead of silently skipping. Stricter, matches STAR.

🤖 Generated with Claude Code

ewels and others added 6 commits April 17, 2026 14:21
Parses STAR's comma-separated multi-RG syntax (first field must be ID:
on each block, `,` separates blocks, tokens are tab-joined into the SAM
@rg body). Replicates a single RG line across N input files like STAR
does, and errors on count mismatch or RG-in-outSAMattributes without a
line. Auto-appends RG to sam_attribute_set() when a line is set (matches
Parameters_samAttributes.cpp:201). SAM/BAM headers now emit one @rg
per unique ID, and every record (mapped, unmapped, half-mapped, both-
mates) carries RG:Z:<id> when attrs contain RG.

Unblocks nf-core/rnaseq, which invokes STAR with
`--outSAMattrRGline 'ID:X' 'SM:X'` and previously errored on the unknown
flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove the `let _ = params.rg_ids()?;` early-validation line in
  `align_reads_paired_end`. `validate()` already calls `rg_ids()` at
  startup, so the second call was a no-op.
- Simplify `maybe_insert_rg_tag` to `(record, rg_id)` by dropping the
  `attrs: &HashSet<String>` parameter and its `attrs.contains("RG")`
  gate. `sam_attribute_set()` auto-inserts "RG" whenever an RG line is
  configured, so `rg_id.is_some()` already implies the tag is wanted.
- Replace the inline `if let Some(id) = rg_id { ... }` block in
  `build_unmapped_record` with a call to the simplified helper, so every
  writer-side RG insertion goes through the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `Parameters::primary_rg_id() -> Result<Option<String>, Error>` that
returns the first RG line's `ID:` value directly, and use it to replace
the `let rg_ids = params.rg_ids()?; let rg_id = rg_ids.first()...`
two-step pattern at every SAM/BAM writer callsite (5 in sam.rs, 1 in
lib.rs).

`rg_ids()` is retained because `validate()` still uses it for the
1-vs-N count check against `--readFilesIn`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the manual `while i < len` index-walk with a
`out_sam_attr_rg_line.split(|tok| tok == ",")` that maps each block
to its tab-joined body and propagates any `ID:`-prefix error via
`collect::<Result<_,_>>()`. Same public signature and return value as
before, but with half the local state.

Behavior note: an empty block (adjacent `,`s or a trailing `,`) now
returns a `Parameter` error instead of being silently skipped. This
matches STAR's `Parameters_readFilesInit.cpp` which requires every RG
block to begin with `ID:`. Existing tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`SamWriter::write_unmapped` had no production callers — the SE/PE
write paths build unmapped records via `build_unmapped_record` and
batch them through `write_batch`. The only caller was the method's
own self-test. Delete both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes formatting drift inherited from main. No semantic changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ewels ewels changed the title feat: add --outSAMattrRGline with @RG header and RG:Z tags feat: add --outSAMattrRGline with @RG header and RG:Z tags Apr 17, 2026
@ewels ewels closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant