fulcrumgenomics · tfenne · Jul 2, 2026 · Jun 26, 2026 · Jun 26, 2026 · Jun 26, 2026
@@ -13,6 +13,60 @@ versioned entry stamped with the release date; new entries should go under
 
 ### Added
 
+- `chelae detect` subcommand: identifies the 3' adapter sequence(s) present
+  in one or two FASTQ files by sampling a modest number of records.
+  - Paired-end input discovers adapters via R1/R2 overlap detection (no kit
+    knowledge required) and builds a position-by-position consensus across
+    reads that landed in the same near-identical k-mer cluster.
+  - Single-end input scores each read against every built-in kit adapter
+    plus any user `--adapter-sequence` / `--adapter-fasta` candidate.
+  - 3'-end cleanups (poly-G default-on at min run 10; poly-X A/C/T default-on
+    at min run 5; cut-right quality trim default-on at `4:20`) are applied
+    to every read before the overlap probe / candidate scan, so 2-color
+    chemistry artifacts and quality-degraded tails don't corrupt detection.
+    All three can be tuned or disabled (`0` for the homopolymer flags,
+    `off`/`none`/`no` for the quality flag).
+  - Sampling stops once `--num-detections` usable detections (default 5000)
+    or `--max-reads` records (default 1M) are reached. A
+    `--min-detections-for-report` floor (default 20) refuses to report on
+    samples too small to be informative.
+  - Console report is two sections: matched kit(s) with their published
+    adapter sequences, and the full-length discovered consensus per mate
+    with **uppercase** marking the kit-stable region and **lowercase**
+    marking any per-sample extension (typically an i7/i5 barcode tail).
+  - Per-position consensus uses a discontinuity-aware cut: a running-min
+    baseline of the per-column majority fraction stops the consensus at
+    the first sharp drop, at an absolute floor of 50% majority, or at
+    the column coverage floor — whichever comes first. This prevents
+    plurality-noise bases (e.g. an imbalanced sample-index pool where one
+    index dominates at 25%) from being emitted as if they were real.
+  - Optional `--output-fasta` writes a cross-sample-portable FASTA: when
+    a discovered sequence matches a known kit (within 1 mismatch over the
+    first 16 bp), the kit's full published adapter is emitted in place of
+    the sample-specific consensus, so the FASTA round-trips cleanly through
+    `chelae trim --adapter-fasta` on every sample in a batch. `.gz`-
+    extensioned output paths are transparently gzip-compressed. If any
+    mate (PE) or candidate (SE) fails to clear `--min-fraction`, detect
+    hard-fails with an actionable error rather than writing a silently
+    incomplete FASTA.
+
+### Fixed
+
+- `chelae trim --expected-insert-size` is now honored. The hint is stored
+  in I-space (insert size) rather than shift-space, so it takes effect on
+  the first pair regardless of variable read length, and `--insert-size-stats`
+  reports anchor against the same I-space estimate.
+
+### Internal
+
+- `BUFFER_SIZE` and the FASTQ-reader opening helper are now shared between
+  `chelae trim` and `chelae detect` via `commands/utils.rs`.
+- The benchmark pipeline gained a `trim-galore-rs` tool wrapper.
+
+## [0.1.0] - 2026-05-13
+
+### Added
+
 - Initial public release of `chelae`.
 - `chelae trim` subcommand: single-pass short-read FASTQ trimming and
   filtering for SE and PE input.
@@ -47,4 +101,5 @@ on 2026-04-21; the entire `chelae trim` implementation was developed as
 The pre-split incremental history (design decisions, performance work,
 benchmarks) lives in the `fqtk` repo.
 
-[Unreleased]: https://github.com/fulcrumgenomics/chelae/compare/HEAD...HEAD
+[Unreleased]: https://github.com/fulcrumgenomics/chelae/compare/v0.1.0...HEAD
+[0.1.0]: https://github.com/fulcrumgenomics/chelae/releases/tag/v0.1.0
@@ -6,11 +6,14 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 `chelae` is a FASTQ trimming and filtering toolkit written in Rust. The name is the plural of *chela*, the pincer-like claws of crustaceans — apt for a tool that clips reads.
 
-Currently the crate ships a single subcommand, `chelae trim`, which runs the common short-read preprocessing pipeline (poly-G trim → adapter trim → read-structure hard-trim + UMI extraction → poly-X trim → sliding-window quality trim → length/N-base/mean-quality/low-qual-fraction filters) on SE or PE FASTQ data. Outputs are BGZF-compressed FASTQ plus a fastp-compatible JSON report for MultiQC consumption.
+The crate ships two subcommands:
 
-Read-structure runs after adapter trim so tail-skip segments (like the `10S` in `+T10S`) drop bases from the post-adapter template, not from the raw read where the adapter step would usually have removed them already. If a read is shorter than the read-structure's fixed-length segments require (either originally or after adapter trim), the pair is dropped and counted under `reads_filtered_length`.
+- **`chelae trim`** runs the common short-read preprocessing pipeline (poly-G trim → adapter trim → read-structure hard-trim + UMI extraction → poly-X trim → sliding-window quality trim → length/N-base/mean-quality/low-qual-fraction filters) on SE or PE FASTQ data. Outputs are BGZF-compressed FASTQ plus a fastp-compatible JSON report for MultiQC consumption.
+- **`chelae detect`** identifies the adapter sequence(s) present in a FASTQ. PE input discovers adapters by harvesting the post-template tails from overlap-detected pairs; SE input scores each read against every built-in kit (plus optional user `--adapter-sequence` / `--adapter-fasta` entries). Reports the discovered sequences and the kit (if any) they match; optional FASTA output is designed to feed straight back into `chelae trim --adapter-fasta`.
 
-The crate structure (multi-subcommand CLI with `enum_dispatch`) is deliberately kept open-ended so additional trimming/filtering utilities can be added as siblings of `trim` without a CLI shape change.
+Read-structure (in `trim`) runs after adapter trim so tail-skip segments (like the `10S` in `+T10S`) drop bases from the post-adapter template, not from the raw read where the adapter step would usually have removed them already. If a read is shorter than the read-structure's fixed-length segments require (either originally or after adapter trim), the pair is dropped and counted under `reads_filtered_length`.
+
+The crate structure (multi-subcommand CLI with `enum_dispatch`) was deliberately built open-ended so additional trimming/filtering/QC utilities can land as siblings of `trim` and `detect` without a CLI shape change.
 
 ## Origin — very important
 

@@ -32,15 +32,22 @@ The name *chelae* is the plural of [*chela*](https://en.wikipedia.org/wiki/Chela
 ## Contents
 
 - [Overview](#overview)
-- [Examples](#examples)
-- [Options](#options)
+- [`chelae trim` — examples](#chelae-trim--examples)
+- [`chelae trim` — options](#chelae-trim--options)
+- [`chelae detect` — examples](#chelae-detect--examples)
+- [`chelae detect` — options](#chelae-detect--options)
 - [Performance](#performance)
 - [Installing](#installing)
 - [Build Targeting and Portability](#build-targeting-and-portability)
 
 ## Overview
 
-`chelae` exposes a single subcommand, `chelae trim`, which performs common short-read preprocessing tasks in one pass, in the following order:
+`chelae` ships two subcommands:
+
+- **`chelae trim`** — short-read FASTQ trim and filter pipeline (see below).
+- **`chelae detect`** — identify the adapter sequence(s) present in a FASTQ. PE input discovers adapters via R1/R2 overlap; SE input scores reads against every built-in kit plus optional user candidates. Optional FASTA output is designed to feed straight back into `chelae trim --adapter-fasta`.
+
+`chelae trim` performs common short-read preprocessing tasks in one pass, in the following order:
 
 1. Poly-G 3' trim (on by default)
 2. Adapter trimming — PE-overlap evidence mode, plus `--kit`, `--adapter-sequence`, and `--adapter-fasta` for SE or deep trimming
@@ -54,7 +61,7 @@ Outputs are BGZF-compressed FASTQ plus a fastp-compatible JSON report suitable f
 
 `chelae` uses paired-read overlap detection combined with adapter-sequence confirmation to rapidly and confidently identify adapter sequence in paired-end reads.  This repository includes a benchmark suite in `benchmark-pipeline/`; `chelae` is the **fastest** tool tested across all experimental setups, while also providing the **highest accuracy** trimming. See the [Performance](#performance) section below.
 
-## Examples
+## `chelae trim` — examples
 
 PE-overlap adapter detection is on by default; for best accuracy it is also recommended to supply your adapter sequences via `--kit` or `--adapter-sequence`.
 
@@ -84,7 +91,7 @@ chelae trim \
     --read-structures 8M4S+T +T
 ```
 
-## Options
+## `chelae trim` — options
 
 The tables below summarize every option accepted by `chelae trim`. For longer
 explanations (rationale, units, edge cases) run `chelae trim --help`.
@@ -156,6 +163,69 @@ aggressive (cuts at the first bad window encountered from the 5' end).
 | `--filter-mean-qual <Q>`          | Drop reads/pairs whose post-trim mean Phred quality is below `Q` (runs last, after every trim stage)     | off     |
 | `--filter-low-qual <Q:F>`         | Drop reads/pairs where the fraction of bases below quality `Q` exceeds `F` (e.g. `15:0.4`)               | off     |
 
+## `chelae detect` — examples
+
+`chelae detect` samples reads from one or two FASTQ files and reports the
+adapter sequence(s) present. PE input discovers adapters via R1/R2 overlap (no
+kit knowledge required); SE input scores reads against every built-in kit plus
+any user-supplied candidate. Discovered/winning adapter sequences can be
+written as FASTA for direct re-use by `chelae trim --adapter-fasta`.
+
+#### Discover the adapters in a paired-end library
+
+```bash
+chelae detect \
+    -i sample.r1.fq.gz sample.r2.fq.gz \
+    -o adapters.fa
+```
+
+#### Identify which built-in kit a single-end library is using
+
+```bash
+chelae detect -i sample.fq.gz
+```
+
+#### Score a single-end library against extra user-supplied candidates
+
+```bash
+chelae detect \
+    -i sample.fq.gz \
+    -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
+    -f custom-adapters.fa
+```
+
+#### Detect adapters, then trim using the discovered FASTA
+
+```bash
+chelae detect -i r1.fq.gz r2.fq.gz -o adapters.fa
+chelae trim -i r1.fq.gz r2.fq.gz -o t.r1.fq.gz t.r2.fq.gz --adapter-fasta adapters.fa
+```
+
+## `chelae detect` — options
+
+Most flags are PE- or SE-only; the help-text annotation in each row says which
+mode the flag applies to. Run `chelae detect --help` for the full rationale.
+
+| Option                                | Description                                                                                                | Default      |
+|---------------------------------------|------------------------------------------------------------------------------------------------------------|--------------|
+| `-i, --inputs <PATHS>...`             | One (SE) or two (PE) FASTQ files; plain, gzip, or bgzf (auto-detected)                                     | —            |
+| `-o, --output-fasta <PATH>`           | Optional FASTA output of discovered/winning adapter(s); ready to feed back into `chelae trim --adapter-fasta` | —            |
+| `-a, --adapter-sequence <SEQ>...`     | (SE only) Extra adapter candidate(s) to score against, in addition to every built-in kit                    | —            |
+| `-f, --adapter-fasta <PATH>`          | (SE only) FASTA of extra adapter candidates; record names are preserved in the report                       | —            |
+| `-n, --num-detections <N>`            | Target number of usable detections before stopping. Higher = more confident composition estimate            | `5000`       |
+| `--max-reads <N>`                     | Hard cap on records scanned even if `--num-detections` isn't reached                                        | `1000000`    |
+| `--min-detections-for-report <N>`     | Refuse to report if final detection count falls below this floor (avoids confident-looking tiny samples)    | `20`         |
+| `--min-fraction <0..1>`               | Minimum share of detections an adapter must account for to be reported                                      | `0.05`       |
+| `--min-tail-length <N>`               | Minimum length of adapter evidence (bp) per detection (PE post-template tail; SE matched alignment)         | `8`          |
+| `--overlap-min-length <N>`            | (PE) Minimum overlap (bp) required for PE-overlap detection                                                 | `30`         |
+| `--overlap-max-mismatch-rate <0..1>`  | (PE) Max fraction of mismatches in the overlap probe                                                        | `0.10`       |
+| `--overlap-diagnostic-length <N>`     | (PE) Upper bound on the probe length per overlap-length candidate (bp)                                      | `64`         |
+| `--adapter-min-length <N>`            | (SE) Minimum match length (bp) when scoring a candidate against a read's 3' end                             | `10`         |
+| `--adapter-mismatch-rate <0..1>`      | (SE) Max fraction of mismatches when matching a candidate against a read's 3' end                           | `0.125`      |
+| `--trim-polyg <N>`                    | 3' poly-G trim min run length applied before the probe (cleans 2-color "no signal" tails); `0` disables     | `10`         |
+| `--trim-polyx <N>`                    | 3' poly-X (A/C/T) trim min run length applied before the probe (more aggressive than trim's `10`); `0` disables | `5`        |
+| `--quality-trim <W:Q>`                | 3' cut-right quality trim applied before the probe; pass `off`/`none`/`no` to disable                       | `4:20`       |
+
 ## Performance
 
 A Snakemake pipeline in [`benchmark-pipeline/`](benchmark-pipeline/) runs