detect: add chelae detect subcommand for adapter identification by tfenne · Pull Request #8 · fulcrumgenomics/chelae

tfenne · 2026-06-26T12:55:46Z

Summary

Adds a second subcommand, chelae detect, that identifies which 3' adapter sequence(s) are present in one or two FASTQ files.

PE mode discovers adapters via R1/R2 overlap (no kit knowledge required), builds a position-by-position consensus across reads in each near-identical k-mer cluster, and reports the consensus extending while per-position coverage stays above an adaptive floor.
SE mode scores each read against every built-in kit adapter plus any user --adapter-sequence / --adapter-fasta candidate via the same scanner trim uses.
3'-end cleanups (poly-G, poly-X, cut-right quality) are applied to every read before the probe, with aggressive defaults reflecting detect's specificity bias (vs trim's sensitivity bias).
Console output is two sections: matched kit(s) with their published adapter sequences, then the full-length discovered consensus with uppercase marking the kit-stable region and lowercase marking any per-sample extension (typically the i7/i5 barcode tail).
Optional FASTA output is cross-sample-portable: when a kit match is detected (within 1 mm over the first 16 bp), the kit's published adapter is emitted in place of the in-sample consensus, so the FASTA round-trips cleanly through chelae trim --adapter-fasta on every sample in a batch.

Also fixes chelae trim --expected-insert-size so the hint actually takes effect on the first pair, and lifts shared FASTQ-reader infrastructure into commands/utils.rs.

Notable design decisions

PE empty adapter-evidence library: detect calls detect_pe_overlap with an empty OverlapAdapterLibrary so probe acceptance discovers novel adapters. Low-complexity inserts (poly-A cfDNA, microsatellites) can produce spurious "discovered" tails, but the kit/library noise floor surfaces in the report rather than hiding the real adapter when present.
Kit-match Hamming threshold = 1 (not 2). A kit match triggers FASTA substitution; 14/16-bp similarity was too loose a bar for that substitution.
u64 counters in TailAccumulator::base_counts (defensive against pathological --num-detections).
Aggressive 3'-end trim defaults in detect (poly-X min run 5, quality trim 4:20) vs trim's conservative ones — detect can afford to scrub more because it's looking for the dominant adapter signal across many reads.

Review history

This branch went through five review rounds (one independent reviewer plus an opus/sonnet/haiku adversarial pass after the major changes). Findings acted on are described in the individual commit messages; findings ruled out (with rationale) are documented inline:

SE FASTA emitting user-supplied --adapter-sequence verbatim (not substituting the kit canonical) — intentional, that's what the user explicitly asked us to score against.
PE insert >= r1_seq.len() || insert >= r2_seq.len() skipping asymmetric-read pairs — matches the docstring contract; sensitivity miss, not correctness bug.
O(N²) aggregation in aggregate_kmers — sub-second at default settings (N ≤ 5000).

Test plan

bash ci/check.sh (fmt + clippy + tests) green on the branch
67 detect tests passing (28 unit + 7 E2E + 32 validation/edge-case)
267 total tests passing
Smoke-tested against synthetic TruSeq PE data with polyG dropoff + Q0 tail
Smoke-tested against synthetic TruSeq PE with a fixed per-sample i7 barcode
Try it on a real FASTQ file from a known kit and confirm the kit match
Try it on a real FASTQ file from an unknown chemistry and confirm a sensible novel-adapter consensus
Round-trip: chelae detect -o adapters.fa → chelae trim --adapter-fasta adapters.fa on the same library, confirm trim metrics look right

`chelae detect` samples reads from one or two FASTQ files and identifies which 3' adapter sequence(s) are present. Two modes: - Paired-end: runs the same overlap probe as `chelae trim` with an empty adapter-evidence library so every probe-accepted shift becomes a candidate detection, harvests the post-template tails on each mate, and builds a position-by-position consensus across reads that landed in the same 16-bp k-mer cluster. The consensus extends as long as per- position coverage stays above an adaptive floor. - Single-end: scores each read against every built-in kit adapter plus any user `--adapter-sequence` / `--adapter-fasta` candidate via the same scanner trim uses. Reports are split into two sections: matched kit(s) with their published sequences, then the full-length discovered consensus with uppercase marking the kit-stable region and lowercase marking the per-sample extension (typically the i7/i5 barcode). Optional `--output-fasta` writes a cross-sample-portable artifact: kit-matched hits emit the kit's published adapter sequence (not the sample-specific consensus), so the FASTA can be fed back into `chelae trim --adapter-fasta` and applied across an entire batch without barcode bleed-through breaking the per-sample match. Shared infrastructure with trim: BUFFER_SIZE and open_fastq_inputs lifted into commands/utils.rs; trim helpers (Adapter, OverlapStats, detect_pe_overlap, find_adapter_3prime, etc.) made pub(crate) for reuse; new load_adapter_fasta_with_names so user-curated FASTA record names survive into the detect output. Defaults are tuned for the discovery use case rather than trim's: --num-detections=5000, --min-detections-for-report=20 floor to refuse reporting on too-small samples, --adapter-min-length=10 (bumped from trim's 6 because detect scores 8+ candidates per read, raising the per-position random-match floor). README updated with examples + options table for the new subcommand.

Three follow-ups from the opus/sonnet/haiku adversarial pass: - Validate `--min-detections-for-report <= --max-reads` (previously validated only against `--num-detections`; a config like `--num-detections 200000 --max-reads 10000 --min-detections-for-report 50000` would pass validation and then guaranteed-fail mid-run). - `KIT_NEAR_MATCH_HAMMING` 2 -> 1. The FASTA writer substitutes the kit's published sequence in for any kit match (exact OR fuzzy), and 2 mm over 16 bp was too loose a bar for that substitution. A novel adapter sharing 14/16 bp with a kit prefix should not be silently relabeled as that kit; real sequencing-error fuzzy matches almost always sit at 1 mm given the per-base error rate. The ALL_KITS members are pairwise far enough that this doesn't cost legitimate matches. - `TailAccumulator::base_counts` widened to `[u64; 5]`. Defensive against pathological `--num-detections` values; theoretical-only at the default 5000. Plus a new substantive feature in response to the same review: apply the 3'-end cleanups from `chelae trim` (poly-G / poly-X / cut-right quality) to each read before it reaches the overlap probe or the candidate scan. detect's specificity bias justifies more aggressive defaults than trim's sensitivity-leaning ones: - `--trim-polyg N` default 10 (matches trim); `0` disables. - `--trim-polyx N` default 5 (down from trim's opt-in default of 10); `0` disables. - `--quality-trim W:Q` default `4:20` (smaller window than trim's `8:20` -> more aggressive); pass `off`/`none` to disable. Cut-right semantics — truncates at the start of the first failing window. Reuses the existing SIMD kernels via two new `pub(crate)` helpers: `find_polyx_tail_len` (already pure) and a freshly-extracted `cut_right_quality_position` (split out of `trim_quality_sliding_5prime` so detect doesn't need an `OwnedRecord`). Two of sonnet's findings were ruled out as non-bugs: the PE `insert >= r1_seq.len() || insert >= r2_seq.len()` skip on asymmetric reads is a sensitivity miss not a correctness bug (and matches the docstring contract); SE FASTA emitting the user's `--adapter-sequence` verbatim instead of substituting the kit canonical is the right behavior because the user explicitly asked us to score against that sequence. README options table updated. Smoke test against synthetic TruSeq + polyG dropoff + Q0 tail confirms the trims fire and the FASTA artifact stays kit-canonical.

- detect.rs: align with CLAUDE.md module-ordering convention. Move `QualityTrimSetting` (a helper enum for `Detect.quality_trim`) below `impl Command for Detect` since "higher-to-lower level" puts the CLI struct first. Move `BASES` from a body-level location up with the other top-of-file constants. Move `AnnotatedHit` up next to the other helper structs (Hit/Candidate/KitMatch/KitMate). Restore a docstring on `push_tail_kmer` that was clipped during a prior reorder. - CHANGELOG: split the previously-stale `[Unreleased]` block into a proper `[0.1.0] - 2026-05-13` entry (the v0.1.0 release), and rewrite `[Unreleased]` as a final-state diff vs v0.1.0 — not a chronological log of intermediate commits. The new entry covers the `chelae detect` subcommand, the `chelae trim --expected-insert-size` fix, and the shared `commands/utils.rs` refactor. No behavior change.

coderabbitai · 2026-06-26T13:09:33Z

Warning

Review limit reached

@tfenne, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 17 minutes and 5 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 037be60d-18fc-4fec-80c4-f1ba37454b21

📥 Commits

Reviewing files that changed from the base of the PR and between 5016330 and 7efc8f1.

📒 Files selected for processing (2)

src/bin/commands/detect.rs
src/bin/commands/utils.rs

📝 Walkthrough

Walkthrough

The PR adds a new chelae detect subcommand, wires it into CLI dispatch, and updates the project docs and changelog. It also introduces shared FASTQ input utilities and makes several trimming helpers reusable across commands. The detect implementation covers paired-end and single-end adapter discovery, reporting, optional FASTA output, validation, and tests.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding the chelae detect subcommand for adapter identification.
Description check	✅ Passed	The description is directly related to the changeset and accurately describes the new detect command and related fixes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tf_detect_cmd

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

src/bin/commands/detect.rs (1)
792-815: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Keep TailAccumulator’s impl adjacent to the struct.

base_to_slot splits the struct from its inherent impl. Move the helper below the impl, or make it an associated helper.

As per coding guidelines, command modules require “structs/enums with immediately-following impl blocks.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bin/commands/detect.rs` around lines 792 - 815, `TailAccumulator` is
separated from its inherent impl by the `base_to_slot` helper, violating the
command-module rule that structs/enums must be immediately followed by their
impl blocks. Move `base_to_slot` below `impl TailAccumulator`, or convert it
into an associated helper on `TailAccumulator`, so the struct and its impl stay
adjacent.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/bin/commands/detect.rs`:
- Around line 862-875: Remove the stale rustdoc about push_tail_kmer from the
effective_trimmed_len documentation block in detect.rs. Keep only the comments
that describe effective_trimmed_len’s 3' cleanup order and returned effective
length, and delete the bucket/k-mer insertion text so the docs match the
function name and behavior.
- Around line 1123-1163: The final human-readable report emitted by
`emit_pe_report` and `emit_se_report` is still going through `info!`, so it is
landing on stderr instead of stdout. Update the report path in these functions
to write explicitly to stdout (for example via a stdout writer/print-style
output), while leaving non-report progress logging on `info!`. Use the existing
symbols `emit_pe_report`, `emit_se_report`, `emit_matched_kits_section`, and
`emit_full_length_rows` to keep the report assembly unchanged apart from the
output target.
- Around line 355-364: The paired-read loop in detect::process_paired_reads
handles R2 EOF as an error but lets R1 EOF break silently, so mismatched inputs
can be accepted. Update the reader1/reader2 matching logic so EOF on reader1
also checks whether reader2 still has data and returns an out-of-sync error when
R2 is longer, mirroring the existing R2 exhausted before R1 path. Use the
reader1.next(), reader2.next(), and process_paired_reads match block to locate
the fix.

In `@src/bin/commands/utils.rs`:
- Around line 17-34: The `open_fastq_inputs` helper is using
`Io::new(paths.len().max(1) as u32, BUFFER_SIZE)` as if the first argument
controlled reader count, but it actually sets the compression level. Update
`open_fastq_inputs` in `utils.rs` to pass a fixed compression level consistent
with the rest of the codebase (as seen in `load_adapter_fasta_with_names`), and
revise the surrounding doc comment to describe the real meaning of
`Io::new(level, buffer_size)` instead of mentioning a thread pool or input
count.

---

Nitpick comments:
In `@src/bin/commands/detect.rs`:
- Around line 792-815: `TailAccumulator` is separated from its inherent impl by
the `base_to_slot` helper, violating the command-module rule that structs/enums
must be immediately followed by their impl blocks. Move `base_to_slot` below
`impl TailAccumulator`, or convert it into an associated helper on
`TailAccumulator`, so the struct and its impl stay adjacent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5694ef6d-25c7-4fa1-b0dc-d8b0289c434c

📥 Commits

Reviewing files that changed from the base of the PR and between df015cf and 5016330.

📒 Files selected for processing (8)

CHANGELOG.md
CLAUDE.md
README.md
src/bin/commands/detect.rs
src/bin/commands/mod.rs
src/bin/commands/trim.rs
src/bin/commands/utils.rs
src/bin/main.rs

coderabbitai · 2026-06-26T13:09:36Z

+/// Emits the human-readable PE report on stdout via `info!`. Two sections:
+///   1. Matched kit(s): one line per mate per kit match, with the kit's full
+///      published adapter sequence.
+///   2. Full-length discovered consensus: per-mate, the consensus as observed
+///      in the data, with uppercase marking the kit-matched prefix and
+///      lowercase marking the extension past it (typically the i7 barcode).
+fn emit_pe_report(
+    detections: u64,
+    reads_scanned: u64,
+    overlap_hits: u64,
+    r1: &[AnnotatedHit<'_>],
+    r2: &[AnnotatedHit<'_>],
+) {
+    info!("chelae detect complete (paired-end):");
+    info!(
+        "  scanned:    {} pair(s) ({} overlap hits, {} usable detections)",
+        fmt_count(reads_scanned),
+        fmt_count(overlap_hits),
+        fmt_count(detections),
+    );
+    emit_matched_kits_section(r1.iter().chain(r2.iter()));
+    info!("  Full-length discovered consensus (uppercase = matches kit, lowercase = extension):");
+    emit_full_length_rows(r1, detections);
+    emit_full_length_rows(r2, detections);
+}
+
+/// Emits the human-readable SE report on stdout via `info!`. Same two-section
+/// layout as PE, with the section-2 sequences rendered through the same
+/// uppercase/lowercase formatter (no-op for SE because the candidate's
+/// sequence and the matched kit's sequence are typically identical).
+fn emit_se_report(detections: u64, reads_scanned: u64, reported: &[AnnotatedHit<'_>]) {
+    info!("chelae detect complete (single-end):");
+    info!(
+        "  scanned:    {} read(s); {} matched a candidate adapter",
+        fmt_count(reads_scanned),
+        fmt_count(detections),
+    );
+    emit_matched_kits_section(reported.iter());
+    info!("  Candidate adapter(s) above --min-fraction:");
+    emit_full_length_rows(reported, detections);
+}


🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n 'env_logger::Builder|Target::Stdout|target\(|info!\(|println!|writeln!\(' src/bin/main.rs src/bin/commands/detect.rs

Repository: fulcrumgenomics/chelae

Length of output: 1843

Route the final report to stdout explicitly.
Current info! calls in emit_pe_report and emit_se_report write to stderr due to the env_logger default target. Redirecting stdout misses the report. Use println! or a writeln!(std::io::stdout()) wrapper for the report sections; keep progress logs on info!.

[docs: src/bin/main.rs:49, src/bin/commands/detect.rs:1136-1163]

Current Report Emitter

fn emit_pe_report(...) { info!("chelae detect complete (paired-end):"); // ... }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bin/commands/detect.rs` around lines 1123 - 1163, The final human-readable report emitted by `emit_pe_report` and `emit_se_report` is still going through `info!`, so it is landing on stderr instead of stdout. Update the report path in these functions to write explicitly to stdout (for example via a stdout writer/print-style output), while leaving non-report progress logging on `info!`. Use the existing symbols `emit_pe_report`, `emit_se_report`, `emit_matched_kits_section`, and `emit_full_length_rows` to keep the report assembly unchanged apart from the output target.

- (#1) PE loop: when R1 hits EOF, check whether R2 still has records. If it does, surface an out-of-sync error mirroring the existing "R2 exhausted before R1" path, instead of silently truncating to the shorter file. Added a `pe_r1_short_produces_clear_error` test. - (#3) detect.rs: remove stale `push_tail_kmer` rustdoc that got pasted onto the top of `effective_trimmed_len`'s doc block during the housekeeping pass. - (#4) utils.rs: `Io::new`'s first argument is the gzip compression level, not a thread-pool size. The original trim code threaded `paths.len()` there in the mistaken belief it sized a decompression pool; it doesn't (fgoxide's readers are synchronous and the level is only consulted by writers). Switched to `Io::new(5, BUFFER_SIZE)` matching `chelae trim`'s other use sites, and rewrote the doc comment to describe the real meaning. - (#5) detect.rs: `base_to_slot` now lives inside `impl TailAccumulator`, restoring the "struct immediately followed by its impl block" adjacency rule that the housekeeping pass had broken. Call sites updated to `TailAccumulator::base_to_slot`. Skipped (#2): the `info!`-vs-stdout report-routing finding. The project-wide pattern is that end-of-run summaries go through `info!` (matches `chelae trim`'s summary path); changing detect alone would introduce an inconsistency. Left for a separate decision.

tfenne added 3 commits June 26, 2026 05:10

coderabbitai Bot requested changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

detect: add chelae detect subcommand for adapter identification#8

detect: add chelae detect subcommand for adapter identification#8
tfenne wants to merge 4 commits into
mainfrom
tf_detect_cmd

tfenne commented Jun 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Review limit reached

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tfenne commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notable design decisions

Review history

Test plan

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tfenne commented Jun 26, 2026 •

edited

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading