Skip to content

Commit 2a76e4f

Browse files
committed
Initial import: chelae — FASTQ trim/filter toolkit extracted from fqtk
Brings the `trim` subcommand (and its associated build / CI / documentation scaffolding) from `fqtk` on branch `tf_trim` into a standalone crate. Source of truth for the pre-split design history is `fqtk`'s `tf_trim_backup` branch (30+ individual commits); the upstream `tf_trim` branch carries the same content as three squashed commits. Pipeline stages (in order): 1. Read-structure hard-trim and UMI extraction 2. Poly-G 3' trim (on by default; SIMD kernel is ~free) 3. Optional poly-X 3' trim (`--trim-polyx`) 4. Adapter trim — PE-overlap evidence mode, plus `--kit`, `--adapter-sequence`, `--adapter-fasta` for SE or deep trimming 5. Optional 5'→3' then 3'→5' sliding-window quality trim 6. Length filter (`--filter-length MIN[:MAX]`) 7. Optional N-base, mean-quality, and low-quality-fraction filters Architecture: * Worker-pool threading: a per-input reader thread (decompress + parse), N worker threads running the full trim → filter → serialize → compress pipeline on pooled batches, and per-output writer threads that stream bytes in input order via oneshot-per-batch ordering. * Portable SIMD (`wide`) kernels: poly-G/X tail, sliding-window 3' quality trim, Q20/Q30/N counting, low-quality-fraction counting, 3' adapter matching (ACGT fast path; IUPAC falls back to scalar), PE-overlap evidence scans. * x86_64 release builds target `x86-64-v3` (AVX2+); a pre-main CPUID guard prevents `SIGILL` on older hardware. Outputs: * BGZF-compressed FASTQ. * fastp-compatible JSON report with per-mate Q20/Q30 stats, filtering breakdown, adapter cutting stats, and detected adapter sequences. * MultiQC consumes the JSON directly via its `fastp` module. Library (`chelae_lib`) exposes just the pieces trim needs: * `IUPAC_MASKS` — 256-byte ASCII-indexed LUT for IUPAC base compatibility. * `adapter_db` — built-in kit presets (truseq, nextera, small-rna, aviti, mgi/dnbseq) and the `all` union selectable via `--kit`. Retained from fqtk: `rust-toolchain.toml` (Rust 1.95), `rustfmt.toml`, `ci/check.sh`, pre-commit script, GitHub Actions workflow (README sync + format / clippy / test matrix), Fulcrum logos, CODEOWNERS, CLAUDE.md project conventions.
0 parents  commit 2a76e4f

24 files changed

Lines changed: 8332 additions & 0 deletions

.cargo/config.toml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Baseline x86_64 release builds to x86-64-v3 (AVX2 + BMI1/BMI2 + FMA).
2+
#
3+
# AVX2 is present on effectively every x86_64 CPU shipped since 2013 (Intel
4+
# Haswell) / 2017 (AMD Zen) / 2021 (Intel Atom Gracemont+), and on essentially
5+
# 100% of cloud compute available today. Building to x86-64-v3 recovers ~6%
6+
# wall-time vs. the default x86-64 (SSE2) baseline on AVX2 hosts — per our
7+
# Granite Rapids EC2 benchmark — and costs nothing on them.
8+
#
9+
# Hosts without AVX2 (pre-Haswell Intel, pre-Zen AMD, pre-Gracemont Atom) will
10+
# SIGILL on illegal instruction. `src/bin/main.rs` runs a CPUID check at the
11+
# top of `main()` and prints a friendly error instead, though that check is
12+
# best-effort — if the Rust runtime emits AVX2 ops before `main()` runs, the
13+
# SIGILL comes first. Users on old hardware should rebuild with the default
14+
# portable baseline via `RUSTFLAGS="-C target-cpu=x86-64" cargo build --release`.
15+
#
16+
# aarch64 targets (Apple Silicon, AWS Graviton) are untouched and keep their
17+
# native NEON baseline.
18+
19+
[target.x86_64-unknown-linux-gnu]
20+
rustflags = ["-C", "target-cpu=x86-64-v3"]
21+
22+
[target.x86_64-unknown-linux-musl]
23+
rustflags = ["-C", "target-cpu=x86-64-v3"]
24+
25+
[target.x86_64-apple-darwin]
26+
rustflags = ["-C", "target-cpu=x86-64-v3"]

.github/CODEOWNERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Default owners: Nils and Tim
2+
* @nh13 @tfenne
Lines changed: 42 additions & 0 deletions
Loading
Lines changed: 42 additions & 0 deletions
Loading

.github/scripts/update-docs.sh

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
5+
echo -e "<!-- start usage -->\n\`\`\`\`console\n" > usage.txt;
6+
./target/debug/chelae trim --help | sed -e 's_^[ ]*$__g' >> usage.txt
7+
echo -e "\`\`\`\`\n<!-- end usage -->" >> usage.txt;
8+
sed -e '/<!-- start usage -->/,/<!-- end usage -->/!b' -e '/<!-- end usage -->/!d;r usage.txt' -e 'd' README.md > README.md.new;
9+
mv README.md.new README.md;
10+
rm usage.txt;
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
name: Check
2+
3+
on: [push, pull_request]
4+
5+
env:
6+
CARGO_TERM_COLOR: always
7+
8+
jobs:
9+
check:
10+
name: Check
11+
runs-on: ubuntu-24.04
12+
steps:
13+
- name: Checkout sources
14+
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
15+
16+
- name: Install stable toolchain
17+
uses: codota/toolchain@00a8bf2bdcfe93aefd70422d3dec07337959d3a4 # v1
18+
with:
19+
profile: minimal
20+
toolchain: stable
21+
22+
- name: Cache dependencies
23+
uses: Swatinem/rust-cache@e18b497796c12c097a38f9edb9d0641fb99eee32 # v2
24+
25+
- name: Run cargo check
26+
uses: actions-rs/cargo@844f36862e911db73fe0815f00a4a2602c279505 # v1
27+
with:
28+
command: check
29+
30+
- name: Run cargo build
31+
uses: actions-rs/cargo@844f36862e911db73fe0815f00a4a2602c279505 # v1
32+
with:
33+
command: build
34+
35+
- name: Append usage to the README.md
36+
shell: bash
37+
run: .github/scripts/update-docs.sh
38+
39+
- name: Verify no unstaged changes
40+
shell: bash
41+
run: |
42+
if [[ "$(git status --porcelain)" != "" ]]; then
43+
echo ----------------------------------------
44+
echo git status
45+
echo ----------------------------------------
46+
git status
47+
echo ----------------------------------------
48+
echo git diff
49+
echo ----------------------------------------
50+
git diff
51+
echo ----------------------------------------
52+
echo Troubleshooting
53+
echo ----------------------------------------
54+
echo '::error::Unstaged changes detected. You probably need to update the usage in the README.md. Use `.github/scripts/update-docs.sh` to update the README.'
55+
exit 1
56+
fi
57+
precommit:
58+
name: Pre-commit
59+
runs-on: ${{ matrix.os }}
60+
strategy:
61+
matrix:
62+
os: [ubuntu-24.04, macOS-latest]
63+
steps:
64+
- name: Checkout sources
65+
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
66+
67+
- name: Install stable toolchain
68+
uses: codota/toolchain@00a8bf2bdcfe93aefd70422d3dec07337959d3a4 # v1
69+
with:
70+
profile: minimal
71+
toolchain: stable
72+
73+
- name: Cache dependencies
74+
uses: Swatinem/rust-cache@e18b497796c12c097a38f9edb9d0641fb99eee32 # v2
75+
76+
- name: Run tests
77+
run: bash src/scripts/precommit.sh
78+

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
/target
2+
.vscode
3+
.idea
4+
.claude

0 commit comments

Comments
 (0)