detect: add chelae detect subcommand for adapter identification#8
detect: add chelae detect subcommand for adapter identification#8tfenne wants to merge 4 commits into
Conversation
`chelae detect` samples reads from one or two FASTQ files and identifies which 3' adapter sequence(s) are present. Two modes: - Paired-end: runs the same overlap probe as `chelae trim` with an empty adapter-evidence library so every probe-accepted shift becomes a candidate detection, harvests the post-template tails on each mate, and builds a position-by-position consensus across reads that landed in the same 16-bp k-mer cluster. The consensus extends as long as per- position coverage stays above an adaptive floor. - Single-end: scores each read against every built-in kit adapter plus any user `--adapter-sequence` / `--adapter-fasta` candidate via the same scanner trim uses. Reports are split into two sections: matched kit(s) with their published sequences, then the full-length discovered consensus with uppercase marking the kit-stable region and lowercase marking the per-sample extension (typically the i7/i5 barcode). Optional `--output-fasta` writes a cross-sample-portable artifact: kit-matched hits emit the kit's published adapter sequence (not the sample-specific consensus), so the FASTA can be fed back into `chelae trim --adapter-fasta` and applied across an entire batch without barcode bleed-through breaking the per-sample match. Shared infrastructure with trim: BUFFER_SIZE and open_fastq_inputs lifted into commands/utils.rs; trim helpers (Adapter, OverlapStats, detect_pe_overlap, find_adapter_3prime, etc.) made pub(crate) for reuse; new load_adapter_fasta_with_names so user-curated FASTA record names survive into the detect output. Defaults are tuned for the discovery use case rather than trim's: --num-detections=5000, --min-detections-for-report=20 floor to refuse reporting on too-small samples, --adapter-min-length=10 (bumped from trim's 6 because detect scores 8+ candidates per read, raising the per-position random-match floor). README updated with examples + options table for the new subcommand.
Three follow-ups from the opus/sonnet/haiku adversarial pass: - Validate `--min-detections-for-report <= --max-reads` (previously validated only against `--num-detections`; a config like `--num-detections 200000 --max-reads 10000 --min-detections-for-report 50000` would pass validation and then guaranteed-fail mid-run). - `KIT_NEAR_MATCH_HAMMING` 2 -> 1. The FASTA writer substitutes the kit's published sequence in for any kit match (exact OR fuzzy), and 2 mm over 16 bp was too loose a bar for that substitution. A novel adapter sharing 14/16 bp with a kit prefix should not be silently relabeled as that kit; real sequencing-error fuzzy matches almost always sit at 1 mm given the per-base error rate. The ALL_KITS members are pairwise far enough that this doesn't cost legitimate matches. - `TailAccumulator::base_counts` widened to `[u64; 5]`. Defensive against pathological `--num-detections` values; theoretical-only at the default 5000. Plus a new substantive feature in response to the same review: apply the 3'-end cleanups from `chelae trim` (poly-G / poly-X / cut-right quality) to each read before it reaches the overlap probe or the candidate scan. detect's specificity bias justifies more aggressive defaults than trim's sensitivity-leaning ones: - `--trim-polyg N` default 10 (matches trim); `0` disables. - `--trim-polyx N` default 5 (down from trim's opt-in default of 10); `0` disables. - `--quality-trim W:Q` default `4:20` (smaller window than trim's `8:20` -> more aggressive); pass `off`/`none` to disable. Cut-right semantics — truncates at the start of the first failing window. Reuses the existing SIMD kernels via two new `pub(crate)` helpers: `find_polyx_tail_len` (already pure) and a freshly-extracted `cut_right_quality_position` (split out of `trim_quality_sliding_5prime` so detect doesn't need an `OwnedRecord`). Two of sonnet's findings were ruled out as non-bugs: the PE `insert >= r1_seq.len() || insert >= r2_seq.len()` skip on asymmetric reads is a sensitivity miss not a correctness bug (and matches the docstring contract); SE FASTA emitting the user's `--adapter-sequence` verbatim instead of substituting the kit canonical is the right behavior because the user explicitly asked us to score against that sequence. README options table updated. Smoke test against synthetic TruSeq + polyG dropoff + Q0 tail confirms the trims fire and the FASTA artifact stays kit-canonical.
- detect.rs: align with CLAUDE.md module-ordering convention. Move `QualityTrimSetting` (a helper enum for `Detect.quality_trim`) below `impl Command for Detect` since "higher-to-lower level" puts the CLI struct first. Move `BASES` from a body-level location up with the other top-of-file constants. Move `AnnotatedHit` up next to the other helper structs (Hit/Candidate/KitMatch/KitMate). Restore a docstring on `push_tail_kmer` that was clipped during a prior reorder. - CHANGELOG: split the previously-stale `[Unreleased]` block into a proper `[0.1.0] - 2026-05-13` entry (the v0.1.0 release), and rewrite `[Unreleased]` as a final-state diff vs v0.1.0 — not a chronological log of intermediate commits. The new entry covers the `chelae detect` subcommand, the `chelae trim --expected-insert-size` fix, and the shared `commands/utils.rs` refactor. No behavior change.
|
Warning Review limit reached
More reviews will be available in 17 minutes and 5 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR adds a new 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
src/bin/commands/detect.rs (1)
792-815: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winKeep
TailAccumulator’s impl adjacent to the struct.
base_to_slotsplits the struct from its inherent impl. Move the helper below the impl, or make it an associated helper.As per coding guidelines, command modules require “structs/enums with immediately-following impl blocks.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bin/commands/detect.rs` around lines 792 - 815, `TailAccumulator` is separated from its inherent impl by the `base_to_slot` helper, violating the command-module rule that structs/enums must be immediately followed by their impl blocks. Move `base_to_slot` below `impl TailAccumulator`, or convert it into an associated helper on `TailAccumulator`, so the struct and its impl stay adjacent.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/bin/commands/detect.rs`:
- Around line 862-875: Remove the stale rustdoc about push_tail_kmer from the
effective_trimmed_len documentation block in detect.rs. Keep only the comments
that describe effective_trimmed_len’s 3' cleanup order and returned effective
length, and delete the bucket/k-mer insertion text so the docs match the
function name and behavior.
- Around line 1123-1163: The final human-readable report emitted by
`emit_pe_report` and `emit_se_report` is still going through `info!`, so it is
landing on stderr instead of stdout. Update the report path in these functions
to write explicitly to stdout (for example via a stdout writer/print-style
output), while leaving non-report progress logging on `info!`. Use the existing
symbols `emit_pe_report`, `emit_se_report`, `emit_matched_kits_section`, and
`emit_full_length_rows` to keep the report assembly unchanged apart from the
output target.
- Around line 355-364: The paired-read loop in detect::process_paired_reads
handles R2 EOF as an error but lets R1 EOF break silently, so mismatched inputs
can be accepted. Update the reader1/reader2 matching logic so EOF on reader1
also checks whether reader2 still has data and returns an out-of-sync error when
R2 is longer, mirroring the existing R2 exhausted before R1 path. Use the
reader1.next(), reader2.next(), and process_paired_reads match block to locate
the fix.
In `@src/bin/commands/utils.rs`:
- Around line 17-34: The `open_fastq_inputs` helper is using
`Io::new(paths.len().max(1) as u32, BUFFER_SIZE)` as if the first argument
controlled reader count, but it actually sets the compression level. Update
`open_fastq_inputs` in `utils.rs` to pass a fixed compression level consistent
with the rest of the codebase (as seen in `load_adapter_fasta_with_names`), and
revise the surrounding doc comment to describe the real meaning of
`Io::new(level, buffer_size)` instead of mentioning a thread pool or input
count.
---
Nitpick comments:
In `@src/bin/commands/detect.rs`:
- Around line 792-815: `TailAccumulator` is separated from its inherent impl by
the `base_to_slot` helper, violating the command-module rule that structs/enums
must be immediately followed by their impl blocks. Move `base_to_slot` below
`impl TailAccumulator`, or convert it into an associated helper on
`TailAccumulator`, so the struct and its impl stay adjacent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5694ef6d-25c7-4fa1-b0dc-d8b0289c434c
📒 Files selected for processing (8)
CHANGELOG.mdCLAUDE.mdREADME.mdsrc/bin/commands/detect.rssrc/bin/commands/mod.rssrc/bin/commands/trim.rssrc/bin/commands/utils.rssrc/bin/main.rs
| /// Emits the human-readable PE report on stdout via `info!`. Two sections: | ||
| /// 1. Matched kit(s): one line per mate per kit match, with the kit's full | ||
| /// published adapter sequence. | ||
| /// 2. Full-length discovered consensus: per-mate, the consensus as observed | ||
| /// in the data, with uppercase marking the kit-matched prefix and | ||
| /// lowercase marking the extension past it (typically the i7 barcode). | ||
| fn emit_pe_report( | ||
| detections: u64, | ||
| reads_scanned: u64, | ||
| overlap_hits: u64, | ||
| r1: &[AnnotatedHit<'_>], | ||
| r2: &[AnnotatedHit<'_>], | ||
| ) { | ||
| info!("chelae detect complete (paired-end):"); | ||
| info!( | ||
| " scanned: {} pair(s) ({} overlap hits, {} usable detections)", | ||
| fmt_count(reads_scanned), | ||
| fmt_count(overlap_hits), | ||
| fmt_count(detections), | ||
| ); | ||
| emit_matched_kits_section(r1.iter().chain(r2.iter())); | ||
| info!(" Full-length discovered consensus (uppercase = matches kit, lowercase = extension):"); | ||
| emit_full_length_rows(r1, detections); | ||
| emit_full_length_rows(r2, detections); | ||
| } | ||
|
|
||
| /// Emits the human-readable SE report on stdout via `info!`. Same two-section | ||
| /// layout as PE, with the section-2 sequences rendered through the same | ||
| /// uppercase/lowercase formatter (no-op for SE because the candidate's | ||
| /// sequence and the matched kit's sequence are typically identical). | ||
| fn emit_se_report(detections: u64, reads_scanned: u64, reported: &[AnnotatedHit<'_>]) { | ||
| info!("chelae detect complete (single-end):"); | ||
| info!( | ||
| " scanned: {} read(s); {} matched a candidate adapter", | ||
| fmt_count(reads_scanned), | ||
| fmt_count(detections), | ||
| ); | ||
| emit_matched_kits_section(reported.iter()); | ||
| info!(" Candidate adapter(s) above --min-fraction:"); | ||
| emit_full_length_rows(reported, detections); | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n 'env_logger::Builder|Target::Stdout|target\(|info!\(|println!|writeln!\(' src/bin/main.rs src/bin/commands/detect.rsRepository: fulcrumgenomics/chelae
Length of output: 1843
Route the final report to stdout explicitly.
Current info! calls in emit_pe_report and emit_se_report write to stderr due to the env_logger default target. Redirecting stdout misses the report. Use println! or a writeln!(std::io::stdout()) wrapper for the report sections; keep progress logs on info!.
[docs: src/bin/main.rs:49, src/bin/commands/detect.rs:1136-1163]
Current Report Emitter
fn emit_pe_report(...) {
info!("chelae detect complete (paired-end):");
// ...
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/bin/commands/detect.rs` around lines 1123 - 1163, The final
human-readable report emitted by `emit_pe_report` and `emit_se_report` is still
going through `info!`, so it is landing on stderr instead of stdout. Update the
report path in these functions to write explicitly to stdout (for example via a
stdout writer/print-style output), while leaving non-report progress logging on
`info!`. Use the existing symbols `emit_pe_report`, `emit_se_report`,
`emit_matched_kits_section`, and `emit_full_length_rows` to keep the report
assembly unchanged apart from the output target.
- (#1) PE loop: when R1 hits EOF, check whether R2 still has records. If it does, surface an out-of-sync error mirroring the existing "R2 exhausted before R1" path, instead of silently truncating to the shorter file. Added a `pe_r1_short_produces_clear_error` test. - (#3) detect.rs: remove stale `push_tail_kmer` rustdoc that got pasted onto the top of `effective_trimmed_len`'s doc block during the housekeeping pass. - (#4) utils.rs: `Io::new`'s first argument is the gzip compression level, not a thread-pool size. The original trim code threaded `paths.len()` there in the mistaken belief it sized a decompression pool; it doesn't (fgoxide's readers are synchronous and the level is only consulted by writers). Switched to `Io::new(5, BUFFER_SIZE)` matching `chelae trim`'s other use sites, and rewrote the doc comment to describe the real meaning. - (#5) detect.rs: `base_to_slot` now lives inside `impl TailAccumulator`, restoring the "struct immediately followed by its impl block" adjacency rule that the housekeeping pass had broken. Call sites updated to `TailAccumulator::base_to_slot`. Skipped (#2): the `info!`-vs-stdout report-routing finding. The project-wide pattern is that end-of-run summaries go through `info!` (matches `chelae trim`'s summary path); changing detect alone would introduce an inconsistency. Left for a separate decision.
Summary
Adds a second subcommand,
chelae detect, that identifies which 3' adapter sequence(s) are present in one or two FASTQ files.--adapter-sequence/--adapter-fastacandidate via the same scanner trim uses.chelae trim --adapter-fastaon every sample in a batch.Also fixes
chelae trim --expected-insert-sizeso the hint actually takes effect on the first pair, and lifts shared FASTQ-reader infrastructure intocommands/utils.rs.Notable design decisions
detect_pe_overlapwith an emptyOverlapAdapterLibraryso probe acceptance discovers novel adapters. Low-complexity inserts (poly-A cfDNA, microsatellites) can produce spurious "discovered" tails, but the kit/library noise floor surfaces in the report rather than hiding the real adapter when present.u64counters inTailAccumulator::base_counts(defensive against pathological--num-detections).4:20) vs trim's conservative ones — detect can afford to scrub more because it's looking for the dominant adapter signal across many reads.Review history
This branch went through five review rounds (one independent reviewer plus an opus/sonnet/haiku adversarial pass after the major changes). Findings acted on are described in the individual commit messages; findings ruled out (with rationale) are documented inline:
--adapter-sequenceverbatim (not substituting the kit canonical) — intentional, that's what the user explicitly asked us to score against.insert >= r1_seq.len() || insert >= r2_seq.len()skipping asymmetric-read pairs — matches the docstring contract; sensitivity miss, not correctness bug.aggregate_kmers— sub-second at default settings (N ≤ 5000).Test plan
bash ci/check.sh(fmt + clippy + tests) green on the branchchelae detect -o adapters.fa→chelae trim --adapter-fasta adapters.faon the same library, confirm trim metrics look right