Skip to content

Commit e96aabf

Browse files
committed
methylsieve: streaming per-template detection of unconverted bisulfite/EM-seq reads
methylsieve is a new tool that reads query-grouped SAM/BAM straight from the aligner and makes a single per-template decision — from cytosines outside the CpG context, where methylation is rare — to flag, QC-fail, or remove reads that escaped bisulfite or enzymatic conversion. It runs inline in the alignment pipe with negligible overhead, offers count/proportion/adaptive decision modes, handles overlapping pairs, single-end reads, and spike-in controls, and emits per-context conversion-rate and per-template decision-matrix metrics. The repo also ships a Snakemake/pixi benchmark pipeline that measures accuracy and throughput against peer tools (NEB mark-nonconverted-reads, biscuit, bismark).
0 parents  commit e96aabf

63 files changed

Lines changed: 13950 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.cargo/config.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[alias]
2+
ci-test = "nextest run --workspace --locked --no-tests=pass"
3+
ci-fmt = "fmt --all -- --check"
4+
ci-lint = "clippy --workspace --all-targets -- -D warnings"
Lines changed: 42 additions & 0 deletions
Loading
Lines changed: 42 additions & 0 deletions
Loading

.github/workflows/check.yml

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
name: Check and Test
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
9+
env:
10+
CARGO_TERM_COLOR: always
11+
CARGO_INCREMENTAL: 0
12+
SCCACHE_GHA_ENABLED: "true"
13+
RUSTC_WRAPPER: "sccache"
14+
15+
jobs:
16+
test:
17+
runs-on: ubuntu-latest
18+
steps:
19+
- name: Checkout code
20+
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
21+
- name: Install Rust toolchain
22+
uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 # stable
23+
- name: Set up compilation cache (sccache)
24+
uses: mozilla-actions/sccache-action@7d986dd989559c6ecdb630a3fd2557667be217ad # v0.0.9
25+
- name: Install nextest
26+
uses: taiki-e/install-action@e4b21e1a4de672bc0de96fc2afab9d8081b1d622 # nextest
27+
- name: Unit tests
28+
run: cargo ci-test
29+
30+
lint:
31+
runs-on: ubuntu-latest
32+
steps:
33+
- name: Checkout code
34+
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
35+
- name: Install Rust toolchain
36+
uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 # stable
37+
with:
38+
components: clippy
39+
- name: Set up compilation cache (sccache)
40+
uses: mozilla-actions/sccache-action@7d986dd989559c6ecdb630a3fd2557667be217ad # v0.0.9
41+
- name: Clippy check
42+
run: cargo ci-lint
43+
44+
format:
45+
runs-on: ubuntu-latest
46+
steps:
47+
- name: Checkout code
48+
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
49+
- name: Install Rust toolchain
50+
uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 # stable
51+
with:
52+
components: rustfmt
53+
- name: Set up compilation cache (sccache)
54+
uses: mozilla-actions/sccache-action@7d986dd989559c6ecdb630a3fd2557667be217ad # v0.0.9
55+
- name: Rustfmt check
56+
run: cargo ci-fmt

.github/workflows/deny.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: Supply chain (cargo-deny)
2+
3+
# Runs on every PR (so a PR that adds a problematic dep fails review) and
4+
# on a weekly cron (so a newly published advisory against an existing dep
5+
# surfaces even on a quiet repo). Decoupled from the main test/lint
6+
# workflow so an advisory drop on Friday doesn't red-flag unrelated PRs.
7+
8+
on:
9+
pull_request:
10+
push:
11+
branches:
12+
- main
13+
schedule:
14+
- cron: "0 12 * * 1" # Mondays 12:00 UTC
15+
workflow_dispatch:
16+
17+
jobs:
18+
cargo-deny:
19+
runs-on: ubuntu-latest
20+
steps:
21+
- name: Checkout code
22+
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
23+
- name: Run cargo-deny
24+
uses: EmbarkStudios/cargo-deny-action@bb137d7af7e4fb67e5f82a49c4fce4fad40782fe # v2.0.20
25+
with:
26+
command: check advisories licenses bans sources

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
/target
2+
*.swp
3+
.DS_Store
4+
.claude/
5+
/plans/

CHANGELOG.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [Unreleased]
9+
10+
### Added
11+
- Initial implementation: streaming, query-grouped SAM/BAM in, BAM out.
12+
- Per-template unconverted-read decision using all primary + supplementary
13+
records of a QNAME, with the decision (an `XX:Z:UC` aux tag and/or the
14+
`0x200` QC-fail flag) propagated to every record of the template.
15+
- Per-record strand determination (`monitor_C = (R1 or unpaired) XOR reverse`),
16+
matching MethylDackel's `getStrand()` and correctly handling reverse-mapped
17+
supplementaries.
18+
- Reference-based context determination for CpA/CpC/CpT/CpG.
19+
- `--mode` for combining the count and proportion tests: `count`, `proportion`,
20+
`either`, or `adaptive` (the default — proportion at/above `--min-sites`,
21+
count below it, so low-site templates are still evaluated while high-site
22+
templates are judged on rate rather than an absolute count that over-penalizes
23+
long reads). `--min-sites` is the proportion floor (below it the proportion is
24+
unestimable and abstains) and the count↔proportion switch point in `adaptive`.
25+
Default thresholds: count 3, fraction 0.05, min-sites 40 (the smallest floor
26+
that keeps the adaptive switch continuous at those values). In `proportion`
27+
mode a stderr warning reports how many templates fell below the floor and
28+
passed through unevaluated; the `below_min_sites_templates` diagnostic in
29+
`--stats` exposes that population in every mode.
30+
- `--ignore-template-ends N`: ignore the outermost N bases at each end of the
31+
template (fragment) when tallying — the end-repair fill-in / A-tailing–prone
32+
positions. Trimmed by genomic position at the 5' sequenced ends of R1 and R2
33+
(the fragment termini), so an overlapped end is dropped in both mates while
34+
interior read ends are kept; single-end / orphan reads fall back to trimming
35+
both ends of the read. Replaces the per-read `--ignore-5p` / `--ignore-3p`.
36+
- Per-record count annotation `ch:Z:x/y` (on by default): x is the unconverted
37+
count and y the total monitored sites in the `--contexts` subset — the exact
38+
numerator/denominator of the decision — as a per-template aggregate stamped on
39+
every record, so any read carries the evidence behind its call. Rename with
40+
`--count-tag <NAME>` or disable with `--no-count-tag`.
41+
- `--min-base-quality` filtering (default 20), and
42+
`--ignore-supplementary-evidence`.
43+
- Spike-in `--control-contig` scopes and a multi-row per-context conversion-rate
44+
`--stats` TSV.
45+
- Verified concordant with NEB `mark-nonconverted-reads` on the shared
46+
(proper-pair) code path.
47+
- `--ref-encoding {bytes,nibble,twobit}`: pack the in-memory reference to trade a
48+
little throughput for a lot of memory. **Default is `twobit`** (2-bit, ~¼ the
49+
resident genome) — chosen because in an input-rate-limited pipe its small CPU
50+
cost is hidden while the memory saving multiplies across parallel sample
51+
pipelines. `twobit` folds non-ACGT bases to A, which never changes a conversion
52+
call and only relabels the CpH/CpG context of a monitored C/G adjacent to a
53+
former-N (gap edges; below measurement noise). `nibble` (4-bit, ~½ RAM) is
54+
bit-identical; `bytes` (1 byte/base) is fastest for a single non-rate-limited
55+
stream. The tally hot path is generic over a `RefCodes` trait, monomorphized
56+
per encoding.
57+
- PE-overlap deduplication: reference positions covered by both mates of an
58+
overlapping proper pair are counted once. The overlap is split at its midpoint
59+
and each mate keeps the half nearer its own 5' end (where base quality is
60+
highest), so neither read's calls dominate the whole overlap. Improves accuracy
61+
for short-insert / cfDNA libraries and avoids redundant work.
62+
63+
[Unreleased]: https://github.com/fulcrumgenomics/methylsieve/commits/main

CONTRIBUTING.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Contributing to methylsieve
2+
3+
Thanks for your interest in improving methylsieve.
4+
5+
## Development
6+
7+
methylsieve targets the stable Rust toolchain pinned in `rust-toolchain.toml`
8+
(edition 2024). The crate clones the IO architecture of its sibling project
9+
`dupblaster` (dedicated `ringbuf`-backed IO threads, `fgumi-raw-bam` zero-copy
10+
records); the bisulfite/EM-seq logic lives in `src/sieve.rs` and
11+
`src/reference.rs`.
12+
13+
### Verification suite
14+
15+
Run all three before sending a change (these mirror CI):
16+
17+
```bash
18+
cargo ci-fmt # rustfmt --check
19+
cargo ci-lint # clippy -D warnings
20+
cargo ci-test # nextest (or `cargo test`)
21+
```
22+
23+
### NEB concordance
24+
25+
`dev/neb_concordance.py` cross-checks methylsieve against NEB's
26+
`mark-nonconverted-reads` on synthetic data. It needs `pysam` and `samtools`:
27+
28+
```bash
29+
python3 -m venv venv && venv/bin/pip install pysam
30+
cargo build --release
31+
venv/bin/python dev/neb_concordance.py
32+
```
33+
34+
## Style & testing
35+
36+
- Idiomatic Rust; meaningful names; doc comments on public items; comments
37+
explain *why*, not *what*.
38+
- Generate test data programmatically — never commit fixture files. Build SAM
39+
inputs and tiny indexed FASTAs inline via the helpers in `tests/helpers`.
40+
- Prefer many small focused tests over table-driven ones. Cover expected
41+
results, error conditions, and boundary cases.
42+
- When matching the behavior of an existing tool, match correctness but feel
43+
free to improve the interface and output.

0 commit comments

Comments
 (0)