Skip to content

Release gtars v0.8.0#240

Open
nsheff wants to merge 99 commits intomasterfrom
dev
Open

Release gtars v0.8.0#240
nsheff wants to merge 99 commits intomasterfrom
dev

Conversation

@nsheff
Copy link
Member

@nsheff nsheff commented Mar 6, 2026

No description provided.

sanghoonio and others added 30 commits February 18, 2026 19:24
Implements trim, promoters, reduce, setdiff, and pintersect as a trait
on RegionSet in gtars-genomicdist. Uses natural chromosome sort order,
preserves zero-width intervals, and saturates at 0 for promoters.
Includes 26 unit tests, demo binary, and benchmark example.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ports GenomicDistributions calc functions to Rust:

Partition system:
- genome_partition_list, calc_partitions, calc_expected_partitions
- GTF/BED gene model loading with GENCODE UTR classification
- Chi-square expected partition analysis with regularized incomplete gamma

Statistics:
- calc_nearest_neighbors (min upstream/downstream per region)
- calc_widths (region end - start)
- calc_feature_distances (signed distance to nearest feature, matching R convention)

Performance:
- is_sorted flag on RegionSet: reduce() checks this flag and skips the
  clone+sort when input is known-sorted (e.g. after BED loading or sort()).
  Cuts ~27% off genome_partition_list, which calls reduce() ~8 times on
  already-sorted intermediate RegionSets.

PR review feedback addressed:
- Lexicographic chromosome sort in reduce() (matches BED convention)
- setdiff/pintersect docstring examples
- Document rest field not preserved, zero-width region behavior
- pintersect truncation behavior documented
- &str spacing fixes

All 57 tests pass. Cross-validated against R on 4 ENCODE BED files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts the is_sorted field from RegionSet in gtars-core to avoid a
breaking change to the shared struct. Instead, adds SortedRegionSet
newtype in gtars-genomicdist that takes ownership and sorts in place
(move, not clone). reduce() uses this internally.

This keeps the optimization local to the crate that needs it without
modifying the core type that all other crates depend on.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
benchmark_interval_ranges.rs is now gitignored along with other
benchmark files. Kept locally for development use.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port R's calcSummarySignal from GenomicDistributions. Overlaps query
regions with a signal matrix (TSV of region × condition values) using
per-chromosome AIList indexes with row indices in the val field, takes
MAX signal per condition across overlapping rows, and computes Tukey
boxplot statistics matching R's fivenum/boxplot.stats.

New module: signal.rs with SignalMatrix, calc_summary_signal,
ConditionStats, and 8 unit tests covering TSV parsing, malformed row
skipping, boxplot stats (odd/even/outlier cases), end-to-end overlap
aggregation, and no-overlap edge case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…vals, distances

Complete extendr-based R binding layer for gtars-genomicdist with
drop-in compatible API matching GenomicDistributions. Includes:
- Load/convert between RegionSet pointers and GRanges/data.frame/BED
- Statistics (widths, neighbor distances, nearest neighbors, chrom stats, region distribution)
- GC content and dinucleotide frequency via GenomeAssembly pointer
- Interval ranges (trim, promoters, reduce, setdiff, pintersect)
- Partition system with strand-aware and GTF-based gene model builders
- Summary signal matrix overlap with boxplot statistics
- TSS/feature distances with proper NA sentinel handling

Also updates gitignore to exclude R test/benchmark files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose statistics (widths, neighbor distances, nearest neighbors),
interval ranges (trim, promoters, reduce, setdiff, pintersect),
partitions (calcPartitions, calcExpectedPartitions with GeneModel),
signal (calcSummarySignal with SignalMatrix), and TSS/feature
distance calculations through gtars-wasm for use in bedbase-ui.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement four set-theoretic operations for comparing and combining
genomic interval sets, enabling replicate concordance analysis and
multi-BED summarization in BEDbase.

Core Rust (gtars-genomicdist):
- Add concat, union, jaccard to IntervalRanges trait and RegionSet impl.
  Jaccard uses inclusion-exclusion on reduced sets (no new intersection
  algorithm needed). Union delegates to concat + reduce.
- New consensus module: given N region sets, computes the union of all
  regions and annotates each with the count of input sets overlapping it.
  Uses MultiChromOverlapper (AIList) per input set for O(N*M*log n) queries.
- Tests for all new functions including edge cases (identical, disjoint,
  empty sets).

WASM bindings (gtars-wasm):
- concat, union, jaccard as methods on JsRegionSet.
- ConsensusBuilder class with add()/compute() pattern to work around
  wasm_bindgen limitations on passing arrays of user-defined types.

R bindings (gtars-r):
- gtars_concat, gtars_union, gtars_jaccard, gtars_consensus R wrappers
  with auto-conversion from GRanges/paths/data.frames via .ensure_regionset().
- Rust extendr functions: r_concat, r_union, r_jaccard, r_consensus.
- Generated man pages via rextendr::document().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers edge cases: empty sets, disjoint/adjacent/overlapping regions,
multi-chromosome inputs, symmetry, containment, and duplicate handling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose gtars-genomicdist library functions as CLI commands behind
the `genomicdist` feature flag:

- `gtars genomicdist` — compute genomic distribution statistics
  (widths, partitions, TSS distances, etc.) and output JSON
- `gtars ranges` — interval set algebra (reduce, trim, promoters,
  setdiff, pintersect, concat, union, jaccard) with BED output
- `gtars consensus` — consensus peak calling across multiple BED
  files with min-count filtering

Also adds serde Serialize/Deserialize derives to library types
(ChromosomeStatistics, RegionBin, PartitionResult, etc.) so the
CLI can serialize them directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Brings in genomicdist, set operations, partitions, and signal
functionality for CLI, WASM, and R bindings. Resolved conflicts
by keeping dev's newer crate versions while adding new genomicdist
dependencies and R wrapper exports. Bumps gtars-wasm to 0.7.1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Node 20.x OIDC publishing broke (npm/cli#8730). Add Node 24.x,
NPM_CONFIG_PROVENANCE env var, and publishConfig in package.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
actions/setup-node with registry-url creates .npmrc with a
${NODE_AUTH_TOKEN} placeholder that prevents npm from falling
through to OIDC trusted publishing. Remove it and add debug
logging to inspect npm config on failure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
registry-url is needed so npm knows the registry endpoint for OIDC
token exchange, but the _authToken placeholder it creates blocks
OIDC fallback. Strip the token line from .npmrc before publishing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-node doesn't create ~/.npmrc on current runners. Write it
manually with just the registry URL (no _authToken placeholder)
so npm knows the endpoint for OIDC exchange without a stale token
blocking it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without registry-url npm gives ENEEDAUTH (doesn't try auth at all).
With it, npm at least enters the auth path. Adding debug for OIDC
env vars and NODE_AUTH_TOKEN to understand why the token is rejected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-node injects a short-lived NODE_AUTH_TOKEN at step time that
expires during the ~5min WASM build. npm uses this stale token
instead of doing a fresh OIDC exchange. Fix by unsetting the token
and stripping _authToken from .npmrc in the same shell as publish.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-node obtains a short-lived OIDC token that expires during the
~5min WASM compilation. Move setup-node to after the build so the
token is seconds old at publish time. wasm-pack only needs Rust,
not Node.js. Also removed debug logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	gtars-r/src/rust/Cargo.toml
CLI additions:
- Add --signal-matrix flag to `gtars genomicdist` with automatic
  format detection (.bin = packed binary, .gz/.txt = TSV)
- Add `gtars prep` subcommand to pre-serialize GTF gene models and
  signal matrices into binary cache files for fast repeated loading
- Add serde derives to Region, RegionSet, Strand, StrandedRegionSet
  to support binary serialization of gene models and signal matrices

Packed binary format for SignalMatrix:
- Flatten values from Vec<Vec<f64>> to row-major Vec<f64>, eliminating
  2.6M individual Vec heap allocations during deserialization (one per
  row in the signal matrix). The flat layout enables a single memcpy
  of the entire 1.5GB f64 array instead of 2.6M separate allocations.
- Use a string intern table (~25 entries) for chromosome names, read
  back as u16 IDs and resolved to Strings, replacing 2.6M individual
  String deserializations with 2.6M cheap clone-from-intern-table ops.
- Column-oriented region storage (chr_ids[], starts[], ends[]) for
  sequential memory access during deserialization.
- Magic number validation (0x5349474D "SIGM") rejects old-format files
  with a clear "regenerate with gtars prep" error message.

Packed binary format for GeneModel:
- Same intern table + column-oriented pattern for each StrandedRegionSet
  component (genes, exons, three_utr, five_utr).
- Strand encoded as single byte (0=Plus, 1=Minus, 2=Unstranded).
- Flags field tracks presence of optional UTR components.
- Magic number 0x474D444C ("GMDL") for format validation.
- File size reduced from 9.7MB to 4.2MB (57% smaller).

Performance (signal matrix deserialization):
- Before (bincode): 2.6M Vec<f64> allocs + 2.6M String allocs = 1.08s
- After (packed): 1 Vec<f64> alloc (memcpy) + intern table = 0.66s

Full pipeline wall time (encode_303, 5751 regions): 1.87s -> 1.42s
Full pipeline wall time (encode_4, 105K regions): 2.50s -> 1.81s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pretty-print remains the default for interactive use. Pipelines like
bedboss can pass --compact to halve intermediate file size.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use calc_feature_distances (signed i64) instead of calc_tss_distances
  (unsigned u32) for proper upstream/downstream TSS distance reporting
- Extract actual TSS positions from gene model using strand info
  (Plus → gene start, Minus → gene end) instead of gene body midpoints
- Add --promoter-upstream (default 200) and --promoter-downstream
  (default 2000) CLI params for partition definitions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sanghoonio and others added 30 commits March 10, 2026 22:32
…sums

Three bugs in calcExpectedPartitions that caused expected value mismatches:

1. Strand-aware setdiff for proximal promoters: into_regionset() was called
   before setdiff(), dropping strand info. R's setdiff is strand-aware, so
   +/- strand promoters at the same position were incorrectly subtracted.

2. trim() silently dropped regions on chromosomes not in chrom_sizes (e.g.,
   chrMT vs chrM naming mismatch). Now keeps such regions unchanged, matching
   R's trim() behavior.

3. R binding pre-reduced genes before computing promoters, collapsing
   overlapping genes and losing their individual promoters. R computes
   promoters(rawGenes) first, then reduces. Removed the pre-reduce.

Also adds promoters_stranded() to preserve strand through the promoter
pipeline, documents the R neighbordt() bug in an Rmd, and updates the
dispatch to pass chromSizes for calcExpectedPartitions.

Results: 5/8 tourney functions exact match, 3 intentional divergences
(2 where gtars is more correct than R, 1 deliberate design choice).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- IGD: 3 new tests comparing old vs new API (single-file, multi-file,
  disk format compatibility)
- LOLA: 10 new tests for enrichment edge cases, FDR correction, and
  database loading
- gtars-py: disambiguate count_overlaps/find_overlaps where both
  IntervalRanges and RegionSetOverlaps traits provide them

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… RegionSetOverlaps

- Remove count_overlaps/find_overlaps from IntervalRanges trait (genomicdist)
- Remove gtars-igd dependency from gtars-genomicdist
- Add min_overlap: Option<i32> to all 4 RegionSetOverlaps methods (overlaprs)
- Rewrite universe.rs to accept pre-built &Igd instead of rebuilding per call
- Switch R bindings to RegionSetOverlaps with min_overlap for countOverlaps/findOverlaps
- Update Python, WASM bindings for new signatures
- Fix loadRegionDB description fallback to use collection folder name (matches R)

Resolves E0034 trait ambiguity in gtars-py. Tourney: 24/27 pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ings

- Fix iter_chroms() to preserve insertion order instead of HashSet
  non-deterministic order (fixes GC content correlation bug)
- Fix S4 dispatch: register methods for character/data.frame instead of
  ANY for Bioconductor generics (narrow, shift, etc.) to avoid hijacking
  GRanges dispatch
- Switch LOLA odds ratio from sample OR (a*d)/(b*c) to Fisher conditional
  MLE matching R's fisher.test()$estimate, using Brent root-finding on the
  noncentral hypergeometric distribution
- Add missing IntervalRanges wasm bindings (shift, flank, resize, narrow,
  disjoin, gaps, intersect) to match R bindings
- Update compare_lola.Rmd for LOLACore compatibility and oddsRatio check
- Update tutorial_regionset.Rmd with tabset comparisons and known diffs
- Gitignore LOLACore database and ext/ test data
- Remove test_dropin.Rmd
- Regenerate roxygen docs via rextendr::document()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use Igd::from_named_region_sets (takes &RegionSet references) instead of
from_region_sets (takes copied (String, i32, i32) tuples), avoiding a
full allocation pass over all regions during database loading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add shift, flank, resize, narrow, disjoin, gaps, and intersect
subcommands to match R binding coverage of IntervalRanges methods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, R casts

- Fix closest() nearest-neighbor search to use bidirectional expanding scan
  instead of fixed ±2 window that missed true nearest neighbors (5 new tests)
- Replace unsafe native-endian f64 serialization in signal.rs with safe
  explicit little-endian encoding, bump SIGM_VERSION to 2 (3 new tests)
- Fix WASM classify_bed_js to return Result instead of panicking on error
- Fix R binding integer casts: add checked_u32 helper for i32→u32 inputs,
  change outputs from Vec<i32> to Vec<f64> to prevent truncation (10 new tests)
Resolve conflicts from gdist-edits PR (#244) merge:

- regionset_ops: Keep min_overlap on RegionSetOverlaps trait (needed by
  R bindings), delegate to MCO batch methods for fast path (min_bp <= 1),
  fall back to in-place iteration with overlap-size filtering for min_bp > 1.
  Add 7 new tests for min_overlap edge cases.

- genomicdist.rs: Combine gtars-lola's chrom_sizes params and no-reduce
  partition fix with dev's checked_u32 input validation.

- multi_chrom_overlapper.rs: Fix MCO consistency test to pass min_overlap.

- region_set.rs (Python): Update callers for min_overlap signature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…API safety

Batch 1 (correctness):
- LOLA: Replace saturating_sub with validated subtraction + NegativeContingency error
- IGD: Guard from_region_sets/from_named_region_sets against end < start intervals
- LOLA: Use from_named_region_sets in merge() to avoid tuple copy overhead
- LOLA: Warn on skipped BED files instead of silent continue

Batch 2 (API safety):
- IGD: Guard add() against negative coordinates
- IGD: Clamp negative query coords in count_overlaps()
- IGD: Remove chr-prefix filter from parse_bed_line (support non-UCSC assemblies)
- IGD: Remove 321M coordinate cap from from_bed_dir

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c comments

Batch 3 (code quality):
- IGD: Merge validation into parse pass, eliminating double file read
- LOLA: Remove unused EmptyUserSet error variant
- IGD: Fix TSV format (remove extra spaces, write avg_region_width as f64)
- genomicdist: Document bp-mode double-counting behavior in partitions
- IGD: Add # Panics doc comments for add(), save(), count_overlaps(),
  find_overlaps_regionset(), count_overlaps_per_query()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Factor out the tile-walk + binary-search + overlap-check pattern that was
copy-pasted across count_overlaps, find_overlaps_regionset, and
count_overlaps_per_query into a single walk_tile_overlaps method that
accepts a closure for per-hit processing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…docs

- Fix resize("center") midpoint overflow and chromosome_statistics width
  sum overflow (u32→u64 accumulator)
- Fix boxplot_stats NaN panic and narrow() underflow with saturating_sub
- Replace disjoin() O(n²) coverage check with O(n log n) sweep-line
- Fix NaN odds ratio ranking (worst rank instead of arbitrary placement)
- Add FDR correction to Python and R LOLA bindings
- Fix R LOLA contingency values (i32→f64) to avoid truncation
- Fix R io.rs unwrap → proper error propagation
- Fix R genomePartitionList: cast f64 coordinates to integer for stranded path
- Revert R output types for counts/widths/distances back to i32 (only
  coordinates need f64 for u32 range)
- Fix R r_narrow type error propagation
- Remove dead sample_odds_ratio, leftover println
- Add IGD hits debug_assert, contigs pub(crate) with accessor
- Document IGD i32 coordinate limit and uniwig sorted-input requirement
- Add regression tests for overflow, NaN, depletion direction
- Add Python LOLA binding tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch CLI handlers and R bindings from legacy create_igd_f/igd_search
to the new unified Igd struct. Add Igd::from_bed_files() constructor
that accepts explicit file paths, enabling directory, .txt filelist,
and stdin input modes. R search now returns structured data (named list)
instead of TSV strings. Remove dead CLI search flags that were never
wired up. Legacy create.rs/search.rs kept in place.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Python 3.12 + uv + maturin + pytest steps to CI workflow
- Create proper testthat framework with 4 test files (82 tests):
  test_lola.R (12), test_refget.R (30), test_regionset.R (28),
  test_genomicdist.R (12) — all using tracked fixture data
- Add tracked LOLA test fixture (lola_multi_db from LOLA R extdata)
- Migrate ad-hoc R test scripts into tests/testthat/, remove originals
- Move Rmd examples to tests/examples/, update paths
- Update README with LOLA/IGD mentions, add LOLA Python example
- Add testthat to DESCRIPTION Suggests
- Clean up .gitignore for new directory structure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validates the unified Igd struct against legacy igd_t/igd_t_from_disk
using real genomic data (6 BED files: CpG islands, lamin B1 LADs, VISTA
enhancers across 2 collections) for broader coverage before merge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
add gtars-lola crate, support in-memory db in IGD struct
R was installed via setup-r but the shared library path wasn't exported,
causing gtars-r test binaries to fail at runtime with "libR.so not found".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…group

pytest-cov>=6.0.0 requires Python >=3.9, and test/build tools shouldn't
be runtime dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv fails to parse pyproject.toml with maturin's dynamic version field
when doing editable installs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The optional-dependencies section displaced dynamic = ["version"] out
of the [project] table, causing maturin to fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix 11 Python tests to match actual Rust binding behavior
- Add R package build and testthat to CI workflow
- Gitignore uv.lock (library, not application)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename BedSet → RegionSetList for naming consistency (Region → RegionSet
→ RegionSetList). Add names, get(), iter(), concat() methods. Wire
get_region_set_list() through RegionDB and all binding layers (R S4 class,
Python pyclass, WASM wasm_bindgen). Add getRegionSets() accessor with
deprecation of regionDBRegionSet(). Include R and Python binding-level
test suites.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix min-rank computation (ties.method="min") and re-enable result sorting
- Auto-detect CSV vs TSV delimiter in LOLA database index.txt parsing
- R bindings: return data.table, integer types for support/b/c/d, empty→NA
- Python bindings: convert empty metadata strings to None
- WASM bindings: add annotate_results call, annotation columns, and size
- Truncate description to 80 chars in annotate_results (matches R LOLA)
- Fix FHR test path after sidecar move to fhr/ subdirectory
- Fix interval_ranges sweep-line to use delta map

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
R LOLA uses `disjoin(unlist(userSets))` which breaks overlapping
intervals at every boundary. gtars was using `reduce()` which merges
them, producing fewer regions and mismatched contingency tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants