fulcrumgenomics
diff --git a/‎.cargo/config.toml‎
Lines changed: 4 additions & 0 deletions b/‎.cargo/config.toml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎.github/logos/fulcrumgenomics-dark.svg‎
Lines changed: 42 additions & 0 deletions b/‎.github/logos/fulcrumgenomics-dark.svg‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎.github/logos/fulcrumgenomics-light.svg‎
Lines changed: 42 additions & 0 deletions b/‎.github/logos/fulcrumgenomics-light.svg‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎.github/workflows/check.yml‎
Lines changed: 56 additions & 0 deletions b/‎.github/workflows/check.yml‎
Lines changed: 56 additions & 0 deletions
diff --git a/‎.github/workflows/deny.yml‎
Lines changed: 26 additions & 0 deletions b/‎.github/workflows/deny.yml‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 5 additions & 0 deletions b/‎.gitignore‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 63 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 43 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 43 additions & 0 deletions
@@ -0,0 +1,4 @@
+[alias]
+ci-test = "nextest run --workspace --locked --no-tests=pass"
+ci-fmt = "fmt --all -- --check"
+ci-lint = "clippy --workspace --all-targets -- -D warnings"
@@ -0,0 +1,56 @@
+name: Check and Test
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+env:
+  CARGO_TERM_COLOR: always
+  CARGO_INCREMENTAL: 0
+  SCCACHE_GHA_ENABLED: "true"
+  RUSTC_WRAPPER: "sccache"
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
+      - name: Install Rust toolchain
+        uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 # stable
+      - name: Set up compilation cache (sccache)
+        uses: mozilla-actions/sccache-action@7d986dd989559c6ecdb630a3fd2557667be217ad # v0.0.9
+      - name: Install nextest
+        uses: taiki-e/install-action@e4b21e1a4de672bc0de96fc2afab9d8081b1d622 # nextest
+      - name: Unit tests
+        run: cargo ci-test
+
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
+      - name: Install Rust toolchain
+        uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 # stable
+        with:
+          components: clippy
+      - name: Set up compilation cache (sccache)
+        uses: mozilla-actions/sccache-action@7d986dd989559c6ecdb630a3fd2557667be217ad # v0.0.9
+      - name: Clippy check
+        run: cargo ci-lint
+
+  format:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
+      - name: Install Rust toolchain
+        uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 # stable
+        with:
+          components: rustfmt
+      - name: Set up compilation cache (sccache)
+        uses: mozilla-actions/sccache-action@7d986dd989559c6ecdb630a3fd2557667be217ad # v0.0.9
+      - name: Rustfmt check
+        run: cargo ci-fmt
@@ -0,0 +1,26 @@
+name: Supply chain (cargo-deny)
+
+# Runs on every PR (so a PR that adds a problematic dep fails review) and
+# on a weekly cron (so a newly published advisory against an existing dep
+# surfaces even on a quiet repo). Decoupled from the main test/lint
+# workflow so an advisory drop on Friday doesn't red-flag unrelated PRs.
+
+on:
+  pull_request:
+  push:
+    branches:
+      - main
+  schedule:
+    - cron: "0 12 * * 1" # Mondays 12:00 UTC
+  workflow_dispatch:
+
+jobs:
+  cargo-deny:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
+      - name: Run cargo-deny
+        uses: EmbarkStudios/cargo-deny-action@bb137d7af7e4fb67e5f82a49c4fce4fad40782fe # v2.0.20
+        with:
+          command: check advisories licenses bans sources
@@ -0,0 +1,5 @@
+/target
+*.swp
+.DS_Store
+.claude/
+/plans/
@@ -0,0 +1,63 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+
+### Added
+- Initial implementation: streaming, query-grouped SAM/BAM in, BAM out.
+- Per-template unconverted-read decision using all primary + supplementary
+  records of a QNAME, with the decision (an `XX:Z:UC` aux tag and/or the
+  `0x200` QC-fail flag) propagated to every record of the template.
+- Per-record strand determination (`monitor_C = (R1 or unpaired) XOR reverse`),
+  matching MethylDackel's `getStrand()` and correctly handling reverse-mapped
+  supplementaries.
+- Reference-based context determination for CpA/CpC/CpT/CpG.
+- `--mode` for combining the count and proportion tests: `count`, `proportion`,
+  `either`, or `adaptive` (the default — proportion at/above `--min-sites`,
+  count below it, so low-site templates are still evaluated while high-site
+  templates are judged on rate rather than an absolute count that over-penalizes
+  long reads). `--min-sites` is the proportion floor (below it the proportion is
+  unestimable and abstains) and the count↔proportion switch point in `adaptive`.
+  Default thresholds: count 3, fraction 0.05, min-sites 40 (the smallest floor
+  that keeps the adaptive switch continuous at those values). In `proportion`
+  mode a stderr warning reports how many templates fell below the floor and
+  passed through unevaluated; the `below_min_sites_templates` diagnostic in
+  `--stats` exposes that population in every mode.
+- `--ignore-template-ends N`: ignore the outermost N bases at each end of the
+  template (fragment) when tallying — the end-repair fill-in / A-tailing–prone
+  positions. Trimmed by genomic position at the 5' sequenced ends of R1 and R2
+  (the fragment termini), so an overlapped end is dropped in both mates while
+  interior read ends are kept; single-end / orphan reads fall back to trimming
+  both ends of the read. Replaces the per-read `--ignore-5p` / `--ignore-3p`.
+- Per-record count annotation `ch:Z:x/y` (on by default): x is the unconverted
+  count and y the total monitored sites in the `--contexts` subset — the exact
+  numerator/denominator of the decision — as a per-template aggregate stamped on
+  every record, so any read carries the evidence behind its call. Rename with
+  `--count-tag <NAME>` or disable with `--no-count-tag`.
+- `--min-base-quality` filtering (default 20), and
+  `--ignore-supplementary-evidence`.
+- Spike-in `--control-contig` scopes and a multi-row per-context conversion-rate
+  `--stats` TSV.
+- Verified concordant with NEB `mark-nonconverted-reads` on the shared
+  (proper-pair) code path.
+- `--ref-encoding {bytes,nibble,twobit}`: pack the in-memory reference to trade a
+  little throughput for a lot of memory. **Default is `twobit`** (2-bit, ~¼ the
+  resident genome) — chosen because in an input-rate-limited pipe its small CPU
+  cost is hidden while the memory saving multiplies across parallel sample
+  pipelines. `twobit` folds non-ACGT bases to A, which never changes a conversion
+  call and only relabels the CpH/CpG context of a monitored C/G adjacent to a
+  former-N (gap edges; below measurement noise). `nibble` (4-bit, ~½ RAM) is
+  bit-identical; `bytes` (1 byte/base) is fastest for a single non-rate-limited
+  stream. The tally hot path is generic over a `RefCodes` trait, monomorphized
+  per encoding.
+- PE-overlap deduplication: reference positions covered by both mates of an
+  overlapping proper pair are counted once. The overlap is split at its midpoint
+  and each mate keeps the half nearer its own 5' end (where base quality is
+  highest), so neither read's calls dominate the whole overlap. Improves accuracy
+  for short-insert / cfDNA libraries and avoids redundant work.
+
+[Unreleased]: https://github.com/fulcrumgenomics/methylsieve/commits/main
@@ -0,0 +1,43 @@
+# Contributing to methylsieve
+
+Thanks for your interest in improving methylsieve.
+
+## Development
+
+methylsieve targets the stable Rust toolchain pinned in `rust-toolchain.toml`
+(edition 2024). The crate clones the IO architecture of its sibling project
+`dupblaster` (dedicated `ringbuf`-backed IO threads, `fgumi-raw-bam` zero-copy
+records); the bisulfite/EM-seq logic lives in `src/sieve.rs` and
+`src/reference.rs`.
+
+### Verification suite
+
+Run all three before sending a change (these mirror CI):
+
+```bash
+cargo ci-fmt    # rustfmt --check
+cargo ci-lint   # clippy -D warnings
+cargo ci-test   # nextest (or `cargo test`)
+```
+
+### NEB concordance
+
+`dev/neb_concordance.py` cross-checks methylsieve against NEB's
+`mark-nonconverted-reads` on synthetic data. It needs `pysam` and `samtools`:
+
+```bash
+python3 -m venv venv && venv/bin/pip install pysam
+cargo build --release
+venv/bin/python dev/neb_concordance.py
+```
+
+## Style & testing
+
+- Idiomatic Rust; meaningful names; doc comments on public items; comments
+  explain *why*, not *what*.
+- Generate test data programmatically — never commit fixture files. Build SAM
+  inputs and tiny indexed FASTAs inline via the helpers in `tests/helpers`.
+- Prefer many small focused tests over table-driven ones. Cover expected
+  results, error conditions, and boundary cases.
+- When matching the behavior of an existing tool, match correctness but feel
+  free to improve the interface and output.
-Original file line number
+Diff line change
@@ @@ -0,0 +1,5 @@ @@
 +/target
 +*.swp
 +.DS_Store
 +.claude/
 +/plans/