Skip to content

detect: add chelae detect subcommand for adapter identification#8

Open
tfenne wants to merge 4 commits into
mainfrom
tf_detect_cmd
Open

detect: add chelae detect subcommand for adapter identification#8
tfenne wants to merge 4 commits into
mainfrom
tf_detect_cmd

Conversation

@tfenne

@tfenne tfenne commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

Adds a second subcommand, chelae detect, that identifies which 3' adapter sequence(s) are present in one or two FASTQ files.

  • PE mode discovers adapters via R1/R2 overlap (no kit knowledge required), builds a position-by-position consensus across reads in each near-identical k-mer cluster, and reports the consensus extending while per-position coverage stays above an adaptive floor.
  • SE mode scores each read against every built-in kit adapter plus any user --adapter-sequence / --adapter-fasta candidate via the same scanner trim uses.
  • 3'-end cleanups (poly-G, poly-X, cut-right quality) are applied to every read before the probe, with aggressive defaults reflecting detect's specificity bias (vs trim's sensitivity bias).
  • Console output is two sections: matched kit(s) with their published adapter sequences, then the full-length discovered consensus with uppercase marking the kit-stable region and lowercase marking any per-sample extension (typically the i7/i5 barcode tail).
  • Optional FASTA output is cross-sample-portable: when a kit match is detected (within 1 mm over the first 16 bp), the kit's published adapter is emitted in place of the in-sample consensus, so the FASTA round-trips cleanly through chelae trim --adapter-fasta on every sample in a batch.

Also fixes chelae trim --expected-insert-size so the hint actually takes effect on the first pair, and lifts shared FASTQ-reader infrastructure into commands/utils.rs.

Notable design decisions

  • PE empty adapter-evidence library: detect calls detect_pe_overlap with an empty OverlapAdapterLibrary so probe acceptance discovers novel adapters. Low-complexity inserts (poly-A cfDNA, microsatellites) can produce spurious "discovered" tails, but the kit/library noise floor surfaces in the report rather than hiding the real adapter when present.
  • Kit-match Hamming threshold = 1 (not 2). A kit match triggers FASTA substitution; 14/16-bp similarity was too loose a bar for that substitution.
  • u64 counters in TailAccumulator::base_counts (defensive against pathological --num-detections).
  • Aggressive 3'-end trim defaults in detect (poly-X min run 5, quality trim 4:20) vs trim's conservative ones — detect can afford to scrub more because it's looking for the dominant adapter signal across many reads.

Review history

This branch went through five review rounds (one independent reviewer plus an opus/sonnet/haiku adversarial pass after the major changes). Findings acted on are described in the individual commit messages; findings ruled out (with rationale) are documented inline:

  1. SE FASTA emitting user-supplied --adapter-sequence verbatim (not substituting the kit canonical) — intentional, that's what the user explicitly asked us to score against.
  2. PE insert >= r1_seq.len() || insert >= r2_seq.len() skipping asymmetric-read pairs — matches the docstring contract; sensitivity miss, not correctness bug.
  3. O(N²) aggregation in aggregate_kmers — sub-second at default settings (N ≤ 5000).

Test plan

  • bash ci/check.sh (fmt + clippy + tests) green on the branch
  • 67 detect tests passing (28 unit + 7 E2E + 32 validation/edge-case)
  • 267 total tests passing
  • Smoke-tested against synthetic TruSeq PE data with polyG dropoff + Q0 tail
  • Smoke-tested against synthetic TruSeq PE with a fixed per-sample i7 barcode
  • Try it on a real FASTQ file from a known kit and confirm the kit match
  • Try it on a real FASTQ file from an unknown chemistry and confirm a sensible novel-adapter consensus
  • Round-trip: chelae detect -o adapters.fachelae trim --adapter-fasta adapters.fa on the same library, confirm trim metrics look right

tfenne added 3 commits June 26, 2026 05:10
`chelae detect` samples reads from one or two FASTQ files and identifies
which 3' adapter sequence(s) are present. Two modes:

- Paired-end: runs the same overlap probe as `chelae trim` with an empty
  adapter-evidence library so every probe-accepted shift becomes a
  candidate detection, harvests the post-template tails on each mate,
  and builds a position-by-position consensus across reads that landed in
  the same 16-bp k-mer cluster. The consensus extends as long as per-
  position coverage stays above an adaptive floor.
- Single-end: scores each read against every built-in kit adapter plus
  any user `--adapter-sequence` / `--adapter-fasta` candidate via the
  same scanner trim uses.

Reports are split into two sections: matched kit(s) with their published
sequences, then the full-length discovered consensus with uppercase
marking the kit-stable region and lowercase marking the per-sample
extension (typically the i7/i5 barcode).

Optional `--output-fasta` writes a cross-sample-portable artifact:
kit-matched hits emit the kit's published adapter sequence (not the
sample-specific consensus), so the FASTA can be fed back into
`chelae trim --adapter-fasta` and applied across an entire batch
without barcode bleed-through breaking the per-sample match.

Shared infrastructure with trim: BUFFER_SIZE and open_fastq_inputs
lifted into commands/utils.rs; trim helpers (Adapter, OverlapStats,
detect_pe_overlap, find_adapter_3prime, etc.) made pub(crate) for
reuse; new load_adapter_fasta_with_names so user-curated FASTA record
names survive into the detect output.

Defaults are tuned for the discovery use case rather than trim's:
--num-detections=5000, --min-detections-for-report=20 floor to refuse
reporting on too-small samples, --adapter-min-length=10 (bumped from
trim's 6 because detect scores 8+ candidates per read, raising the
per-position random-match floor).

README updated with examples + options table for the new subcommand.
Three follow-ups from the opus/sonnet/haiku adversarial pass:

- Validate `--min-detections-for-report <= --max-reads` (previously
  validated only against `--num-detections`; a config like
  `--num-detections 200000 --max-reads 10000 --min-detections-for-report
  50000` would pass validation and then guaranteed-fail mid-run).
- `KIT_NEAR_MATCH_HAMMING` 2 -> 1. The FASTA writer substitutes the kit's
  published sequence in for any kit match (exact OR fuzzy), and 2 mm
  over 16 bp was too loose a bar for that substitution. A novel adapter
  sharing 14/16 bp with a kit prefix should not be silently relabeled
  as that kit; real sequencing-error fuzzy matches almost always sit at
  1 mm given the per-base error rate. The ALL_KITS members are pairwise
  far enough that this doesn't cost legitimate matches.
- `TailAccumulator::base_counts` widened to `[u64; 5]`. Defensive
  against pathological `--num-detections` values; theoretical-only at
  the default 5000.

Plus a new substantive feature in response to the same review: apply
the 3'-end cleanups from `chelae trim` (poly-G / poly-X / cut-right
quality) to each read before it reaches the overlap probe or the
candidate scan. detect's specificity bias justifies more aggressive
defaults than trim's sensitivity-leaning ones:

- `--trim-polyg N` default 10 (matches trim); `0` disables.
- `--trim-polyx N` default 5 (down from trim's opt-in default of 10);
  `0` disables.
- `--quality-trim W:Q` default `4:20` (smaller window than trim's
  `8:20` -> more aggressive); pass `off`/`none` to disable. Cut-right
  semantics — truncates at the start of the first failing window.

Reuses the existing SIMD kernels via two new `pub(crate)` helpers:
`find_polyx_tail_len` (already pure) and a freshly-extracted
`cut_right_quality_position` (split out of `trim_quality_sliding_5prime`
so detect doesn't need an `OwnedRecord`).

Two of sonnet's findings were ruled out as non-bugs: the PE
`insert >= r1_seq.len() || insert >= r2_seq.len()` skip on asymmetric
reads is a sensitivity miss not a correctness bug (and matches the
docstring contract); SE FASTA emitting the user's `--adapter-sequence`
verbatim instead of substituting the kit canonical is the right
behavior because the user explicitly asked us to score against that
sequence.

README options table updated. Smoke test against synthetic TruSeq +
polyG dropoff + Q0 tail confirms the trims fire and the FASTA
artifact stays kit-canonical.
- detect.rs: align with CLAUDE.md module-ordering convention. Move
  `QualityTrimSetting` (a helper enum for `Detect.quality_trim`) below
  `impl Command for Detect` since "higher-to-lower level" puts the CLI
  struct first. Move `BASES` from a body-level location up with the
  other top-of-file constants. Move `AnnotatedHit` up next to the other
  helper structs (Hit/Candidate/KitMatch/KitMate). Restore a docstring
  on `push_tail_kmer` that was clipped during a prior reorder.

- CHANGELOG: split the previously-stale `[Unreleased]` block into a
  proper `[0.1.0] - 2026-05-13` entry (the v0.1.0 release), and rewrite
  `[Unreleased]` as a final-state diff vs v0.1.0 — not a chronological
  log of intermediate commits. The new entry covers the `chelae detect`
  subcommand, the `chelae trim --expected-insert-size` fix, and the
  shared `commands/utils.rs` refactor.

No behavior change.
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@tfenne, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 17 minutes and 5 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 037be60d-18fc-4fec-80c4-f1ba37454b21

📥 Commits

Reviewing files that changed from the base of the PR and between 5016330 and 7efc8f1.

📒 Files selected for processing (2)
  • src/bin/commands/detect.rs
  • src/bin/commands/utils.rs
📝 Walkthrough

Walkthrough

The PR adds a new chelae detect subcommand, wires it into CLI dispatch, and updates the project docs and changelog. It also introduces shared FASTQ input utilities and makes several trimming helpers reusable across commands. The detect implementation covers paired-end and single-end adapter discovery, reporting, optional FASTA output, validation, and tests.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding the chelae detect subcommand for adapter identification.
Description check ✅ Passed The description is directly related to the changeset and accurately describes the new detect command and related fixes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tf_detect_cmd

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
src/bin/commands/detect.rs (1)

792-815: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Keep TailAccumulator’s impl adjacent to the struct.

base_to_slot splits the struct from its inherent impl. Move the helper below the impl, or make it an associated helper.

As per coding guidelines, command modules require “structs/enums with immediately-following impl blocks.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bin/commands/detect.rs` around lines 792 - 815, `TailAccumulator` is
separated from its inherent impl by the `base_to_slot` helper, violating the
command-module rule that structs/enums must be immediately followed by their
impl blocks. Move `base_to_slot` below `impl TailAccumulator`, or convert it
into an associated helper on `TailAccumulator`, so the struct and its impl stay
adjacent.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/bin/commands/detect.rs`:
- Around line 862-875: Remove the stale rustdoc about push_tail_kmer from the
effective_trimmed_len documentation block in detect.rs. Keep only the comments
that describe effective_trimmed_len’s 3' cleanup order and returned effective
length, and delete the bucket/k-mer insertion text so the docs match the
function name and behavior.
- Around line 1123-1163: The final human-readable report emitted by
`emit_pe_report` and `emit_se_report` is still going through `info!`, so it is
landing on stderr instead of stdout. Update the report path in these functions
to write explicitly to stdout (for example via a stdout writer/print-style
output), while leaving non-report progress logging on `info!`. Use the existing
symbols `emit_pe_report`, `emit_se_report`, `emit_matched_kits_section`, and
`emit_full_length_rows` to keep the report assembly unchanged apart from the
output target.
- Around line 355-364: The paired-read loop in detect::process_paired_reads
handles R2 EOF as an error but lets R1 EOF break silently, so mismatched inputs
can be accepted. Update the reader1/reader2 matching logic so EOF on reader1
also checks whether reader2 still has data and returns an out-of-sync error when
R2 is longer, mirroring the existing R2 exhausted before R1 path. Use the
reader1.next(), reader2.next(), and process_paired_reads match block to locate
the fix.

In `@src/bin/commands/utils.rs`:
- Around line 17-34: The `open_fastq_inputs` helper is using
`Io::new(paths.len().max(1) as u32, BUFFER_SIZE)` as if the first argument
controlled reader count, but it actually sets the compression level. Update
`open_fastq_inputs` in `utils.rs` to pass a fixed compression level consistent
with the rest of the codebase (as seen in `load_adapter_fasta_with_names`), and
revise the surrounding doc comment to describe the real meaning of
`Io::new(level, buffer_size)` instead of mentioning a thread pool or input
count.

---

Nitpick comments:
In `@src/bin/commands/detect.rs`:
- Around line 792-815: `TailAccumulator` is separated from its inherent impl by
the `base_to_slot` helper, violating the command-module rule that structs/enums
must be immediately followed by their impl blocks. Move `base_to_slot` below
`impl TailAccumulator`, or convert it into an associated helper on
`TailAccumulator`, so the struct and its impl stay adjacent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5694ef6d-25c7-4fa1-b0dc-d8b0289c434c

📥 Commits

Reviewing files that changed from the base of the PR and between df015cf and 5016330.

📒 Files selected for processing (8)
  • CHANGELOG.md
  • CLAUDE.md
  • README.md
  • src/bin/commands/detect.rs
  • src/bin/commands/mod.rs
  • src/bin/commands/trim.rs
  • src/bin/commands/utils.rs
  • src/bin/main.rs

Comment thread src/bin/commands/detect.rs
Comment thread src/bin/commands/detect.rs Outdated
Comment on lines +1123 to +1163
/// Emits the human-readable PE report on stdout via `info!`. Two sections:
/// 1. Matched kit(s): one line per mate per kit match, with the kit's full
/// published adapter sequence.
/// 2. Full-length discovered consensus: per-mate, the consensus as observed
/// in the data, with uppercase marking the kit-matched prefix and
/// lowercase marking the extension past it (typically the i7 barcode).
fn emit_pe_report(
detections: u64,
reads_scanned: u64,
overlap_hits: u64,
r1: &[AnnotatedHit<'_>],
r2: &[AnnotatedHit<'_>],
) {
info!("chelae detect complete (paired-end):");
info!(
" scanned: {} pair(s) ({} overlap hits, {} usable detections)",
fmt_count(reads_scanned),
fmt_count(overlap_hits),
fmt_count(detections),
);
emit_matched_kits_section(r1.iter().chain(r2.iter()));
info!(" Full-length discovered consensus (uppercase = matches kit, lowercase = extension):");
emit_full_length_rows(r1, detections);
emit_full_length_rows(r2, detections);
}

/// Emits the human-readable SE report on stdout via `info!`. Same two-section
/// layout as PE, with the section-2 sequences rendered through the same
/// uppercase/lowercase formatter (no-op for SE because the candidate's
/// sequence and the matched kit's sequence are typically identical).
fn emit_se_report(detections: u64, reads_scanned: u64, reported: &[AnnotatedHit<'_>]) {
info!("chelae detect complete (single-end):");
info!(
" scanned: {} read(s); {} matched a candidate adapter",
fmt_count(reads_scanned),
fmt_count(detections),
);
emit_matched_kits_section(reported.iter());
info!(" Candidate adapter(s) above --min-fraction:");
emit_full_length_rows(reported, detections);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n 'env_logger::Builder|Target::Stdout|target\(|info!\(|println!|writeln!\(' src/bin/main.rs src/bin/commands/detect.rs

Repository: fulcrumgenomics/chelae

Length of output: 1843


Route the final report to stdout explicitly.
Current info! calls in emit_pe_report and emit_se_report write to stderr due to the env_logger default target. Redirecting stdout misses the report. Use println! or a writeln!(std::io::stdout()) wrapper for the report sections; keep progress logs on info!.

[docs: src/bin/main.rs:49, src/bin/commands/detect.rs:1136-1163]

Current Report Emitter
fn emit_pe_report(...) {
    info!("chelae detect complete (paired-end):");
    // ...
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bin/commands/detect.rs` around lines 1123 - 1163, The final
human-readable report emitted by `emit_pe_report` and `emit_se_report` is still
going through `info!`, so it is landing on stderr instead of stdout. Update the
report path in these functions to write explicitly to stdout (for example via a
stdout writer/print-style output), while leaving non-report progress logging on
`info!`. Use the existing symbols `emit_pe_report`, `emit_se_report`,
`emit_matched_kits_section`, and `emit_full_length_rows` to keep the report
assembly unchanged apart from the output target.

Comment thread src/bin/commands/utils.rs
- (#1) PE loop: when R1 hits EOF, check whether R2 still has records.
  If it does, surface an out-of-sync error mirroring the existing
  "R2 exhausted before R1" path, instead of silently truncating to
  the shorter file. Added a `pe_r1_short_produces_clear_error` test.

- (#3) detect.rs: remove stale `push_tail_kmer` rustdoc that got
  pasted onto the top of `effective_trimmed_len`'s doc block during
  the housekeeping pass.

- (#4) utils.rs: `Io::new`'s first argument is the gzip compression
  level, not a thread-pool size. The original trim code threaded
  `paths.len()` there in the mistaken belief it sized a decompression
  pool; it doesn't (fgoxide's readers are synchronous and the level
  is only consulted by writers). Switched to `Io::new(5, BUFFER_SIZE)`
  matching `chelae trim`'s other use sites, and rewrote the doc
  comment to describe the real meaning.

- (#5) detect.rs: `base_to_slot` now lives inside `impl
  TailAccumulator`, restoring the "struct immediately followed by its
  impl block" adjacency rule that the housekeeping pass had broken.
  Call sites updated to `TailAccumulator::base_to_slot`.

Skipped (#2): the `info!`-vs-stdout report-routing finding. The
project-wide pattern is that end-of-run summaries go through `info!`
(matches `chelae trim`'s summary path); changing detect alone would
introduce an inconsistency. Left for a separate decision.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant