Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 56 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,60 @@ versioned entry stamped with the release date; new entries should go under

### Added

- `chelae detect` subcommand: identifies the 3' adapter sequence(s) present
in one or two FASTQ files by sampling a modest number of records.
- Paired-end input discovers adapters via R1/R2 overlap detection (no kit
knowledge required) and builds a position-by-position consensus across
reads that landed in the same near-identical k-mer cluster.
- Single-end input scores each read against every built-in kit adapter
plus any user `--adapter-sequence` / `--adapter-fasta` candidate.
- 3'-end cleanups (poly-G default-on at min run 10; poly-X A/C/T default-on
at min run 5; cut-right quality trim default-on at `4:20`) are applied
to every read before the overlap probe / candidate scan, so 2-color
chemistry artifacts and quality-degraded tails don't corrupt detection.
All three can be tuned or disabled (`0` for the homopolymer flags,
`off`/`none`/`no` for the quality flag).
- Sampling stops once `--num-detections` usable detections (default 5000)
or `--max-reads` records (default 1M) are reached. A
`--min-detections-for-report` floor (default 20) refuses to report on
samples too small to be informative.
- Console report is two sections: matched kit(s) with their published
adapter sequences, and the full-length discovered consensus per mate
with **uppercase** marking the kit-stable region and **lowercase**
marking any per-sample extension (typically an i7/i5 barcode tail).
- Per-position consensus uses a discontinuity-aware cut: a running-min
baseline of the per-column majority fraction stops the consensus at
the first sharp drop, at an absolute floor of 50% majority, or at
the column coverage floor — whichever comes first. This prevents
plurality-noise bases (e.g. an imbalanced sample-index pool where one
index dominates at 25%) from being emitted as if they were real.
- Optional `--output-fasta` writes a cross-sample-portable FASTA: when
a discovered sequence matches a known kit (within 1 mismatch over the
first 16 bp), the kit's full published adapter is emitted in place of
the sample-specific consensus, so the FASTA round-trips cleanly through
`chelae trim --adapter-fasta` on every sample in a batch. `.gz`-
extensioned output paths are transparently gzip-compressed. If any
mate (PE) or candidate (SE) fails to clear `--min-fraction`, detect
hard-fails with an actionable error rather than writing a silently
incomplete FASTA.

### Fixed

- `chelae trim --expected-insert-size` is now honored. The hint is stored
in I-space (insert size) rather than shift-space, so it takes effect on
the first pair regardless of variable read length, and `--insert-size-stats`
reports anchor against the same I-space estimate.

### Internal

- `BUFFER_SIZE` and the FASTQ-reader opening helper are now shared between
`chelae trim` and `chelae detect` via `commands/utils.rs`.
- The benchmark pipeline gained a `trim-galore-rs` tool wrapper.

## [0.1.0] - 2026-05-13

### Added

- Initial public release of `chelae`.
- `chelae trim` subcommand: single-pass short-read FASTQ trimming and
filtering for SE and PE input.
Expand Down Expand Up @@ -47,4 +101,5 @@ on 2026-04-21; the entire `chelae trim` implementation was developed as
The pre-split incremental history (design decisions, performance work,
benchmarks) lives in the `fqtk` repo.

[Unreleased]: https://github.com/fulcrumgenomics/chelae/compare/HEAD...HEAD
[Unreleased]: https://github.com/fulcrumgenomics/chelae/compare/v0.1.0...HEAD
[0.1.0]: https://github.com/fulcrumgenomics/chelae/releases/tag/v0.1.0
9 changes: 6 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,14 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

`chelae` is a FASTQ trimming and filtering toolkit written in Rust. The name is the plural of *chela*, the pincer-like claws of crustaceans — apt for a tool that clips reads.

Currently the crate ships a single subcommand, `chelae trim`, which runs the common short-read preprocessing pipeline (poly-G trim → adapter trim → read-structure hard-trim + UMI extraction → poly-X trim → sliding-window quality trim → length/N-base/mean-quality/low-qual-fraction filters) on SE or PE FASTQ data. Outputs are BGZF-compressed FASTQ plus a fastp-compatible JSON report for MultiQC consumption.
The crate ships two subcommands:

Read-structure runs after adapter trim so tail-skip segments (like the `10S` in `+T10S`) drop bases from the post-adapter template, not from the raw read where the adapter step would usually have removed them already. If a read is shorter than the read-structure's fixed-length segments require (either originally or after adapter trim), the pair is dropped and counted under `reads_filtered_length`.
- **`chelae trim`** runs the common short-read preprocessing pipeline (poly-G trim → adapter trim → read-structure hard-trim + UMI extraction → poly-X trim → sliding-window quality trim → length/N-base/mean-quality/low-qual-fraction filters) on SE or PE FASTQ data. Outputs are BGZF-compressed FASTQ plus a fastp-compatible JSON report for MultiQC consumption.
- **`chelae detect`** identifies the adapter sequence(s) present in a FASTQ. PE input discovers adapters by harvesting the post-template tails from overlap-detected pairs; SE input scores each read against every built-in kit (plus optional user `--adapter-sequence` / `--adapter-fasta` entries). Reports the discovered sequences and the kit (if any) they match; optional FASTA output is designed to feed straight back into `chelae trim --adapter-fasta`.

The crate structure (multi-subcommand CLI with `enum_dispatch`) is deliberately kept open-ended so additional trimming/filtering utilities can be added as siblings of `trim` without a CLI shape change.
Read-structure (in `trim`) runs after adapter trim so tail-skip segments (like the `10S` in `+T10S`) drop bases from the post-adapter template, not from the raw read where the adapter step would usually have removed them already. If a read is shorter than the read-structure's fixed-length segments require (either originally or after adapter trim), the pair is dropped and counted under `reads_filtered_length`.

The crate structure (multi-subcommand CLI with `enum_dispatch`) was deliberately built open-ended so additional trimming/filtering/QC utilities can land as siblings of `trim` and `detect` without a CLI shape change.

## Origin — very important

Expand Down
80 changes: 75 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,22 @@ The name *chelae* is the plural of [*chela*](https://en.wikipedia.org/wiki/Chela
## Contents

- [Overview](#overview)
- [Examples](#examples)
- [Options](#options)
- [`chelae trim` — examples](#chelae-trim--examples)
- [`chelae trim` — options](#chelae-trim--options)
- [`chelae detect` — examples](#chelae-detect--examples)
- [`chelae detect` — options](#chelae-detect--options)
- [Performance](#performance)
- [Installing](#installing)
- [Build Targeting and Portability](#build-targeting-and-portability)

## Overview

`chelae` exposes a single subcommand, `chelae trim`, which performs common short-read preprocessing tasks in one pass, in the following order:
`chelae` ships two subcommands:

- **`chelae trim`** — short-read FASTQ trim and filter pipeline (see below).
- **`chelae detect`** — identify the adapter sequence(s) present in a FASTQ. PE input discovers adapters via R1/R2 overlap; SE input scores reads against every built-in kit plus optional user candidates. Optional FASTA output is designed to feed straight back into `chelae trim --adapter-fasta`.

`chelae trim` performs common short-read preprocessing tasks in one pass, in the following order:

1. Poly-G 3' trim (on by default)
2. Adapter trimming — PE-overlap evidence mode, plus `--kit`, `--adapter-sequence`, and `--adapter-fasta` for SE or deep trimming
Expand All @@ -54,7 +61,7 @@ Outputs are BGZF-compressed FASTQ plus a fastp-compatible JSON report suitable f

`chelae` uses paired-read overlap detection combined with adapter-sequence confirmation to rapidly and confidently identify adapter sequence in paired-end reads. This repository includes a benchmark suite in `benchmark-pipeline/`; `chelae` is the **fastest** tool tested across all experimental setups, while also providing the **highest accuracy** trimming. See the [Performance](#performance) section below.

## Examples
## `chelae trim` — examples

PE-overlap adapter detection is on by default; for best accuracy it is also recommended to supply your adapter sequences via `--kit` or `--adapter-sequence`.

Expand Down Expand Up @@ -84,7 +91,7 @@ chelae trim \
--read-structures 8M4S+T +T
```

## Options
## `chelae trim` — options

The tables below summarize every option accepted by `chelae trim`. For longer
explanations (rationale, units, edge cases) run `chelae trim --help`.
Expand Down Expand Up @@ -156,6 +163,69 @@ aggressive (cuts at the first bad window encountered from the 5' end).
| `--filter-mean-qual <Q>` | Drop reads/pairs whose post-trim mean Phred quality is below `Q` (runs last, after every trim stage) | off |
| `--filter-low-qual <Q:F>` | Drop reads/pairs where the fraction of bases below quality `Q` exceeds `F` (e.g. `15:0.4`) | off |

## `chelae detect` — examples

`chelae detect` samples reads from one or two FASTQ files and reports the
adapter sequence(s) present. PE input discovers adapters via R1/R2 overlap (no
kit knowledge required); SE input scores reads against every built-in kit plus
any user-supplied candidate. Discovered/winning adapter sequences can be
written as FASTA for direct re-use by `chelae trim --adapter-fasta`.

#### Discover the adapters in a paired-end library

```bash
chelae detect \
-i sample.r1.fq.gz sample.r2.fq.gz \
-o adapters.fa
```

#### Identify which built-in kit a single-end library is using

```bash
chelae detect -i sample.fq.gz
```

#### Score a single-end library against extra user-supplied candidates

```bash
chelae detect \
-i sample.fq.gz \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
-f custom-adapters.fa
```

#### Detect adapters, then trim using the discovered FASTA

```bash
chelae detect -i r1.fq.gz r2.fq.gz -o adapters.fa
chelae trim -i r1.fq.gz r2.fq.gz -o t.r1.fq.gz t.r2.fq.gz --adapter-fasta adapters.fa
```

## `chelae detect` — options

Most flags are PE- or SE-only; the help-text annotation in each row says which
mode the flag applies to. Run `chelae detect --help` for the full rationale.

| Option | Description | Default |
|---------------------------------------|------------------------------------------------------------------------------------------------------------|--------------|
| `-i, --inputs <PATHS>...` | One (SE) or two (PE) FASTQ files; plain, gzip, or bgzf (auto-detected) | — |
| `-o, --output-fasta <PATH>` | Optional FASTA output of discovered/winning adapter(s); ready to feed back into `chelae trim --adapter-fasta` | — |
| `-a, --adapter-sequence <SEQ>...` | (SE only) Extra adapter candidate(s) to score against, in addition to every built-in kit | — |
| `-f, --adapter-fasta <PATH>` | (SE only) FASTA of extra adapter candidates; record names are preserved in the report | — |
| `-n, --num-detections <N>` | Target number of usable detections before stopping. Higher = more confident composition estimate | `5000` |
| `--max-reads <N>` | Hard cap on records scanned even if `--num-detections` isn't reached | `1000000` |
| `--min-detections-for-report <N>` | Refuse to report if final detection count falls below this floor (avoids confident-looking tiny samples) | `20` |
| `--min-fraction <0..1>` | Minimum share of detections an adapter must account for to be reported | `0.05` |
| `--min-tail-length <N>` | Minimum length of adapter evidence (bp) per detection (PE post-template tail; SE matched alignment) | `8` |
| `--overlap-min-length <N>` | (PE) Minimum overlap (bp) required for PE-overlap detection | `30` |
| `--overlap-max-mismatch-rate <0..1>` | (PE) Max fraction of mismatches in the overlap probe | `0.10` |
| `--overlap-diagnostic-length <N>` | (PE) Upper bound on the probe length per overlap-length candidate (bp) | `64` |
| `--adapter-min-length <N>` | (SE) Minimum match length (bp) when scoring a candidate against a read's 3' end | `10` |
| `--adapter-mismatch-rate <0..1>` | (SE) Max fraction of mismatches when matching a candidate against a read's 3' end | `0.125` |
| `--trim-polyg <N>` | 3' poly-G trim min run length applied before the probe (cleans 2-color "no signal" tails); `0` disables | `10` |
| `--trim-polyx <N>` | 3' poly-X (A/C/T) trim min run length applied before the probe (more aggressive than trim's `10`); `0` disables | `5` |
| `--quality-trim <W:Q>` | 3' cut-right quality trim applied before the probe; pass `off`/`none`/`no` to disable | `4:20` |

## Performance

A Snakemake pipeline in [`benchmark-pipeline/`](benchmark-pipeline/) runs
Expand Down
Loading
Loading