Conversation
Implements trim, promoters, reduce, setdiff, and pintersect as a trait on RegionSet in gtars-genomicdist. Uses natural chromosome sort order, preserves zero-width intervals, and saturates at 0 for promoters. Includes 26 unit tests, demo binary, and benchmark example. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ports GenomicDistributions calc functions to Rust: Partition system: - genome_partition_list, calc_partitions, calc_expected_partitions - GTF/BED gene model loading with GENCODE UTR classification - Chi-square expected partition analysis with regularized incomplete gamma Statistics: - calc_nearest_neighbors (min upstream/downstream per region) - calc_widths (region end - start) - calc_feature_distances (signed distance to nearest feature, matching R convention) Performance: - is_sorted flag on RegionSet: reduce() checks this flag and skips the clone+sort when input is known-sorted (e.g. after BED loading or sort()). Cuts ~27% off genome_partition_list, which calls reduce() ~8 times on already-sorted intermediate RegionSets. PR review feedback addressed: - Lexicographic chromosome sort in reduce() (matches BED convention) - setdiff/pintersect docstring examples - Document rest field not preserved, zero-width region behavior - pintersect truncation behavior documented - &str spacing fixes All 57 tests pass. Cross-validated against R on 4 ENCODE BED files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts the is_sorted field from RegionSet in gtars-core to avoid a breaking change to the shared struct. Instead, adds SortedRegionSet newtype in gtars-genomicdist that takes ownership and sorts in place (move, not clone). reduce() uses this internally. This keeps the optimization local to the crate that needs it without modifying the core type that all other crates depend on. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
benchmark_interval_ranges.rs is now gitignored along with other benchmark files. Kept locally for development use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port R's calcSummarySignal from GenomicDistributions. Overlaps query regions with a signal matrix (TSV of region × condition values) using per-chromosome AIList indexes with row indices in the val field, takes MAX signal per condition across overlapping rows, and computes Tukey boxplot statistics matching R's fivenum/boxplot.stats. New module: signal.rs with SignalMatrix, calc_summary_signal, ConditionStats, and 8 unit tests covering TSV parsing, malformed row skipping, boxplot stats (odd/even/outlier cases), end-to-end overlap aggregation, and no-overlap edge case. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…vals, distances Complete extendr-based R binding layer for gtars-genomicdist with drop-in compatible API matching GenomicDistributions. Includes: - Load/convert between RegionSet pointers and GRanges/data.frame/BED - Statistics (widths, neighbor distances, nearest neighbors, chrom stats, region distribution) - GC content and dinucleotide frequency via GenomeAssembly pointer - Interval ranges (trim, promoters, reduce, setdiff, pintersect) - Partition system with strand-aware and GTF-based gene model builders - Summary signal matrix overlap with boxplot statistics - TSS/feature distances with proper NA sentinel handling Also updates gitignore to exclude R test/benchmark files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose statistics (widths, neighbor distances, nearest neighbors), interval ranges (trim, promoters, reduce, setdiff, pintersect), partitions (calcPartitions, calcExpectedPartitions with GeneModel), signal (calcSummarySignal with SignalMatrix), and TSS/feature distance calculations through gtars-wasm for use in bedbase-ui. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement four set-theoretic operations for comparing and combining genomic interval sets, enabling replicate concordance analysis and multi-BED summarization in BEDbase. Core Rust (gtars-genomicdist): - Add concat, union, jaccard to IntervalRanges trait and RegionSet impl. Jaccard uses inclusion-exclusion on reduced sets (no new intersection algorithm needed). Union delegates to concat + reduce. - New consensus module: given N region sets, computes the union of all regions and annotates each with the count of input sets overlapping it. Uses MultiChromOverlapper (AIList) per input set for O(N*M*log n) queries. - Tests for all new functions including edge cases (identical, disjoint, empty sets). WASM bindings (gtars-wasm): - concat, union, jaccard as methods on JsRegionSet. - ConsensusBuilder class with add()/compute() pattern to work around wasm_bindgen limitations on passing arrays of user-defined types. R bindings (gtars-r): - gtars_concat, gtars_union, gtars_jaccard, gtars_consensus R wrappers with auto-conversion from GRanges/paths/data.frames via .ensure_regionset(). - Rust extendr functions: r_concat, r_union, r_jaccard, r_consensus. - Generated man pages via rextendr::document(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Covers edge cases: empty sets, disjoint/adjacent/overlapping regions, multi-chromosome inputs, symmetry, containment, and duplicate handling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose gtars-genomicdist library functions as CLI commands behind the `genomicdist` feature flag: - `gtars genomicdist` — compute genomic distribution statistics (widths, partitions, TSS distances, etc.) and output JSON - `gtars ranges` — interval set algebra (reduce, trim, promoters, setdiff, pintersect, concat, union, jaccard) with BED output - `gtars consensus` — consensus peak calling across multiple BED files with min-count filtering Also adds serde Serialize/Deserialize derives to library types (ChromosomeStatistics, RegionBin, PartitionResult, etc.) so the CLI can serialize them directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Brings in genomicdist, set operations, partitions, and signal functionality for CLI, WASM, and R bindings. Resolved conflicts by keeping dev's newer crate versions while adding new genomicdist dependencies and R wrapper exports. Bumps gtars-wasm to 0.7.1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Node 20.x OIDC publishing broke (npm/cli#8730). Add Node 24.x, NPM_CONFIG_PROVENANCE env var, and publishConfig in package.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
actions/setup-node with registry-url creates .npmrc with a
${NODE_AUTH_TOKEN} placeholder that prevents npm from falling
through to OIDC trusted publishing. Remove it and add debug
logging to inspect npm config on failure.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
registry-url is needed so npm knows the registry endpoint for OIDC token exchange, but the _authToken placeholder it creates blocks OIDC fallback. Strip the token line from .npmrc before publishing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-node doesn't create ~/.npmrc on current runners. Write it manually with just the registry URL (no _authToken placeholder) so npm knows the endpoint for OIDC exchange without a stale token blocking it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without registry-url npm gives ENEEDAUTH (doesn't try auth at all). With it, npm at least enters the auth path. Adding debug for OIDC env vars and NODE_AUTH_TOKEN to understand why the token is rejected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-node injects a short-lived NODE_AUTH_TOKEN at step time that expires during the ~5min WASM build. npm uses this stale token instead of doing a fresh OIDC exchange. Fix by unsetting the token and stripping _authToken from .npmrc in the same shell as publish. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup-node obtains a short-lived OIDC token that expires during the ~5min WASM compilation. Move setup-node to after the build so the token is seconds old at publish time. wasm-pack only needs Rust, not Node.js. Also removed debug logging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # gtars-r/src/rust/Cargo.toml
CLI additions:
- Add --signal-matrix flag to `gtars genomicdist` with automatic
format detection (.bin = packed binary, .gz/.txt = TSV)
- Add `gtars prep` subcommand to pre-serialize GTF gene models and
signal matrices into binary cache files for fast repeated loading
- Add serde derives to Region, RegionSet, Strand, StrandedRegionSet
to support binary serialization of gene models and signal matrices
Packed binary format for SignalMatrix:
- Flatten values from Vec<Vec<f64>> to row-major Vec<f64>, eliminating
2.6M individual Vec heap allocations during deserialization (one per
row in the signal matrix). The flat layout enables a single memcpy
of the entire 1.5GB f64 array instead of 2.6M separate allocations.
- Use a string intern table (~25 entries) for chromosome names, read
back as u16 IDs and resolved to Strings, replacing 2.6M individual
String deserializations with 2.6M cheap clone-from-intern-table ops.
- Column-oriented region storage (chr_ids[], starts[], ends[]) for
sequential memory access during deserialization.
- Magic number validation (0x5349474D "SIGM") rejects old-format files
with a clear "regenerate with gtars prep" error message.
Packed binary format for GeneModel:
- Same intern table + column-oriented pattern for each StrandedRegionSet
component (genes, exons, three_utr, five_utr).
- Strand encoded as single byte (0=Plus, 1=Minus, 2=Unstranded).
- Flags field tracks presence of optional UTR components.
- Magic number 0x474D444C ("GMDL") for format validation.
- File size reduced from 9.7MB to 4.2MB (57% smaller).
Performance (signal matrix deserialization):
- Before (bincode): 2.6M Vec<f64> allocs + 2.6M String allocs = 1.08s
- After (packed): 1 Vec<f64> alloc (memcpy) + intern table = 0.66s
Full pipeline wall time (encode_303, 5751 regions): 1.87s -> 1.42s
Full pipeline wall time (encode_4, 105K regions): 2.50s -> 1.81s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pretty-print remains the default for interactive use. Pipelines like bedboss can pass --compact to halve intermediate file size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use calc_feature_distances (signed i64) instead of calc_tss_distances (unsigned u32) for proper upstream/downstream TSS distance reporting - Extract actual TSS positions from gene model using strand info (Plus → gene start, Minus → gene end) instead of gene body midpoints - Add --promoter-upstream (default 200) and --promoter-downstream (default 2000) CLI params for partition definitions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sums Three bugs in calcExpectedPartitions that caused expected value mismatches: 1. Strand-aware setdiff for proximal promoters: into_regionset() was called before setdiff(), dropping strand info. R's setdiff is strand-aware, so +/- strand promoters at the same position were incorrectly subtracted. 2. trim() silently dropped regions on chromosomes not in chrom_sizes (e.g., chrMT vs chrM naming mismatch). Now keeps such regions unchanged, matching R's trim() behavior. 3. R binding pre-reduced genes before computing promoters, collapsing overlapping genes and losing their individual promoters. R computes promoters(rawGenes) first, then reduces. Removed the pre-reduce. Also adds promoters_stranded() to preserve strand through the promoter pipeline, documents the R neighbordt() bug in an Rmd, and updates the dispatch to pass chromSizes for calcExpectedPartitions. Results: 5/8 tourney functions exact match, 3 intentional divergences (2 where gtars is more correct than R, 1 deliberate design choice). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- IGD: 3 new tests comparing old vs new API (single-file, multi-file, disk format compatibility) - LOLA: 10 new tests for enrichment edge cases, FDR correction, and database loading - gtars-py: disambiguate count_overlaps/find_overlaps where both IntervalRanges and RegionSetOverlaps traits provide them Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… RegionSetOverlaps - Remove count_overlaps/find_overlaps from IntervalRanges trait (genomicdist) - Remove gtars-igd dependency from gtars-genomicdist - Add min_overlap: Option<i32> to all 4 RegionSetOverlaps methods (overlaprs) - Rewrite universe.rs to accept pre-built &Igd instead of rebuilding per call - Switch R bindings to RegionSetOverlaps with min_overlap for countOverlaps/findOverlaps - Update Python, WASM bindings for new signatures - Fix loadRegionDB description fallback to use collection folder name (matches R) Resolves E0034 trait ambiguity in gtars-py. Tourney: 24/27 pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ings - Fix iter_chroms() to preserve insertion order instead of HashSet non-deterministic order (fixes GC content correlation bug) - Fix S4 dispatch: register methods for character/data.frame instead of ANY for Bioconductor generics (narrow, shift, etc.) to avoid hijacking GRanges dispatch - Switch LOLA odds ratio from sample OR (a*d)/(b*c) to Fisher conditional MLE matching R's fisher.test()$estimate, using Brent root-finding on the noncentral hypergeometric distribution - Add missing IntervalRanges wasm bindings (shift, flank, resize, narrow, disjoin, gaps, intersect) to match R bindings - Update compare_lola.Rmd for LOLACore compatibility and oddsRatio check - Update tutorial_regionset.Rmd with tabset comparisons and known diffs - Gitignore LOLACore database and ext/ test data - Remove test_dropin.Rmd - Regenerate roxygen docs via rextendr::document() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use Igd::from_named_region_sets (takes &RegionSet references) instead of from_region_sets (takes copied (String, i32, i32) tuples), avoiding a full allocation pass over all regions during database loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add shift, flank, resize, narrow, disjoin, gaps, and intersect subcommands to match R binding coverage of IntervalRanges methods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, R casts - Fix closest() nearest-neighbor search to use bidirectional expanding scan instead of fixed ±2 window that missed true nearest neighbors (5 new tests) - Replace unsafe native-endian f64 serialization in signal.rs with safe explicit little-endian encoding, bump SIGM_VERSION to 2 (3 new tests) - Fix WASM classify_bed_js to return Result instead of panicking on error - Fix R binding integer casts: add checked_u32 helper for i32→u32 inputs, change outputs from Vec<i32> to Vec<f64> to prevent truncation (10 new tests)
Fix genomicdist bugs
Resolve conflicts from gdist-edits PR (#244) merge: - regionset_ops: Keep min_overlap on RegionSetOverlaps trait (needed by R bindings), delegate to MCO batch methods for fast path (min_bp <= 1), fall back to in-place iteration with overlap-size filtering for min_bp > 1. Add 7 new tests for min_overlap edge cases. - genomicdist.rs: Combine gtars-lola's chrom_sizes params and no-reduce partition fix with dev's checked_u32 input validation. - multi_chrom_overlapper.rs: Fix MCO consistency test to pass min_overlap. - region_set.rs (Python): Update callers for min_overlap signature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…API safety Batch 1 (correctness): - LOLA: Replace saturating_sub with validated subtraction + NegativeContingency error - IGD: Guard from_region_sets/from_named_region_sets against end < start intervals - LOLA: Use from_named_region_sets in merge() to avoid tuple copy overhead - LOLA: Warn on skipped BED files instead of silent continue Batch 2 (API safety): - IGD: Guard add() against negative coordinates - IGD: Clamp negative query coords in count_overlaps() - IGD: Remove chr-prefix filter from parse_bed_line (support non-UCSC assemblies) - IGD: Remove 321M coordinate cap from from_bed_dir Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c comments Batch 3 (code quality): - IGD: Merge validation into parse pass, eliminating double file read - LOLA: Remove unused EmptyUserSet error variant - IGD: Fix TSV format (remove extra spaces, write avg_region_width as f64) - genomicdist: Document bp-mode double-counting behavior in partitions - IGD: Add # Panics doc comments for add(), save(), count_overlaps(), find_overlaps_regionset(), count_overlaps_per_query() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Factor out the tile-walk + binary-search + overlap-check pattern that was copy-pasted across count_overlaps, find_overlaps_regionset, and count_overlaps_per_query into a single walk_tile_overlaps method that accepts a closure for per-hit processing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…docs
- Fix resize("center") midpoint overflow and chromosome_statistics width
sum overflow (u32→u64 accumulator)
- Fix boxplot_stats NaN panic and narrow() underflow with saturating_sub
- Replace disjoin() O(n²) coverage check with O(n log n) sweep-line
- Fix NaN odds ratio ranking (worst rank instead of arbitrary placement)
- Add FDR correction to Python and R LOLA bindings
- Fix R LOLA contingency values (i32→f64) to avoid truncation
- Fix R io.rs unwrap → proper error propagation
- Fix R genomePartitionList: cast f64 coordinates to integer for stranded path
- Revert R output types for counts/widths/distances back to i32 (only
coordinates need f64 for u32 range)
- Fix R r_narrow type error propagation
- Remove dead sample_odds_ratio, leftover println
- Add IGD hits debug_assert, contigs pub(crate) with accessor
- Document IGD i32 coordinate limit and uniwig sorted-input requirement
- Add regression tests for overflow, NaN, depletion direction
- Add Python LOLA binding tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch CLI handlers and R bindings from legacy create_igd_f/igd_search to the new unified Igd struct. Add Igd::from_bed_files() constructor that accepts explicit file paths, enabling directory, .txt filelist, and stdin input modes. R search now returns structured data (named list) instead of TSV strings. Remove dead CLI search flags that were never wired up. Legacy create.rs/search.rs kept in place. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Python 3.12 + uv + maturin + pytest steps to CI workflow - Create proper testthat framework with 4 test files (82 tests): test_lola.R (12), test_refget.R (30), test_regionset.R (28), test_genomicdist.R (12) — all using tracked fixture data - Add tracked LOLA test fixture (lola_multi_db from LOLA R extdata) - Migrate ad-hoc R test scripts into tests/testthat/, remove originals - Move Rmd examples to tests/examples/, update paths - Update README with LOLA/IGD mentions, add LOLA Python example - Add testthat to DESCRIPTION Suggests - Clean up .gitignore for new directory structure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validates the unified Igd struct against legacy igd_t/igd_t_from_disk using real genomic data (6 BED files: CpG islands, lamin B1 LADs, VISTA enhancers across 2 collections) for broader coverage before merge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
add gtars-lola crate, support in-memory db in IGD struct
R was installed via setup-r but the shared library path wasn't exported, causing gtars-r test binaries to fail at runtime with "libR.so not found". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…group pytest-cov>=6.0.0 requires Python >=3.9, and test/build tools shouldn't be runtime dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
uv fails to parse pyproject.toml with maturin's dynamic version field when doing editable installs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The optional-dependencies section displaced dynamic = ["version"] out of the [project] table, causing maturin to fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix 11 Python tests to match actual Rust binding behavior - Add R package build and testthat to CI workflow - Gitignore uv.lock (library, not application) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename BedSet → RegionSetList for naming consistency (Region → RegionSet → RegionSetList). Add names, get(), iter(), concat() methods. Wire get_region_set_list() through RegionDB and all binding layers (R S4 class, Python pyclass, WASM wasm_bindgen). Add getRegionSets() accessor with deprecation of regionDBRegionSet(). Include R and Python binding-level test suites. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix min-rank computation (ties.method="min") and re-enable result sorting - Auto-detect CSV vs TSV delimiter in LOLA database index.txt parsing - R bindings: return data.table, integer types for support/b/c/d, empty→NA - Python bindings: convert empty metadata strings to None - WASM bindings: add annotate_results call, annotation columns, and size - Truncate description to 80 chars in annotate_results (matches R LOLA) - Fix FHR test path after sidecar move to fhr/ subdirectory - Fix interval_ranges sweep-line to use delta map Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
R LOLA uses `disjoin(unlist(userSets))` which breaks overlapping intervals at every boundary. gtars was using `reduce()` which merges them, producing fewer regions and mismatched contingency tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.