Skip to content

Releases: epigen/enrichment_analysis

v3.0.0 - Configurable GREAT region annotation and smaller aggregate outputs

08 Jun 13:27
0829687

Choose a tag to compare

This release changes how GREAT-associated regions and genes are exported to reduce runtime, file size, duplicated data, and Excel compatibility problems caused by very large annotation cells.

Features (breaking change)

  • Added great_parameters:map_associated_regions in the config to control how many significant GREAT terms are annotated with associated query regions and genes in individual query result tables. This update to the configuration file represents the breaking change.
  • The new default is 1, so users can inspect an example annotated term without paying the cost of annotating every significant term.
  • Supported values:
    • 0: do not annotate GREAT terms with associated regions/genes
    • positive integer, for example 5: annotate that many top significant terms ranked by the configured adjusted p-value column
    • -1: annotate all significant terms, restoring the previous behavior

Changes

  • GREAT group-level aggregated result CSV files no longer include the regions and annotated_genes columns.
  • These columns remain available in individual GREAT query result CSV files when annotation is enabled with map_associated_regions.
  • This avoids duplicating very large region/gene annotation strings across aggregate files and prevents aggregate outputs from becoming unnecessarily large.
  • GREAT parameters are passed through Snakemake rule params instead of being read directly from the config inside the GREAT analysis script.

Documentation

  • Updated the example config and schema with great_parameters:map_associated_regions.
  • Updated the README and config documentation to warn that GREAT region/gene annotation can take a long time, substantially increase file size, and break Excel usage because cells can exceed Excel's 32,767 character limit.

What's Changed

Full Changelog: v2.1.0...v3.0.0

v2.1.0 - Additional genomic regions analysis outputs and new summary visualization

07 May 10:25
921efa4

Choose a tag to compare

This release expands the enrichment workflow with the addition of a 'specific' group summary plot, additional pycisTarget and GREAT outputs, more robust handling of sparse or empty inputs, and updated documentation and infrastructure.

Features

  • The group summary outputs are now *_summary_topTerms.png and *_summary_specificTerms.png, a new group-level summary plot highlighting terms that are more specific to individual groups.
  • Extended pycisTarget outputs with additional exported tables for motif hits and cistromes.
  • Extended GREAT result tables with associated query regions and annotated genes for significant terms.
  • Added helpers/features_to_bed.py, a helper script for converting feature ID lists into BED files by mapping IDs to genomic coordinates from an annotation table.
  • Added schema validation for workflow configuration and annotation files in workflow/schemas/.

Changes

  • Removed the previous heatmap-based group summary outputs, now it is only bubble plots.
  • Removed filtered aggregate _sig.csv outputs; group-level aggregation now focuses on the complete _all.csv table and summary plots.
  • Updated the example configuration and test setup. The default config now runs a minimal test spanning all workflow functionalities. The data for this test is restored via test/setup_test_resources.sh.

Bug Fixes

  • Improved handling of empty BED, query, and background inputs in region-based workflows.
  • Fixed coordinate convention mismatches between BED inputs and GREAT-derived region-gene association outputs. Adopted exporting coordinates in BED-style indexing as the convention.
  • Improved aggregation and summary plotting for sparse result sets, including very small matrices and empty outputs.
  • Added stable fallback behavior for empty enrichment and summary plots so reporting remains informative when no significant results are found.

Documentation

  • Expanded guidance on skipping selected enrichment tools when they are not needed.
  • Added example Snakemake rule templates for downloading common enrichment resources, including gene set databases, cisTarget resources, and LOLA region databases in /helpers/database_download_rules.md.
  • Updated the README to describe the current output structure, BED indexing conventions, and new summary plot outputs.

Infrastructure

  • Added GitHub Actions for CI, container image generation, and conda environment pinning.
  • Added Snakemake containerization support.

What's Changed

New Contributors

Full Changelog: v2.0.3...v2.1.0

v2.0.3 - minor improvement

25 Jun 17:59

Choose a tag to compare

v2.0.2 - Minor fixes

27 May 15:17

Choose a tag to compare

  • Make all resource files input to rules

Full Changelog: v2.0.1...v2.0.2

v2.0.1 - enable module usage using `github()` directive

20 Dec 14:29
6be93bc

Choose a tag to compare

  • to enable module usage using github() directive
    • source utils.R via paramsinstead ofsnakemake@source`
    • comment global.yaml (now requires full snakemake installation, not minimal)
  • add nodefaults to all env YAML and comment global.env usage
  • fix stringi version

What's Changed

New Contributors

Full Changelog: v2.0.0...v2.0.1

v2.0.0 - Snakemake 8 compatible

13 Sep 13:39

Choose a tag to compare

Breaking change: Requires Snakemake >= v8.20.1

Full Changelog: v1.0.1...v2.0.0

v1.0.1 - bug fixes and exception handling

07 Jul 14:27

Choose a tag to compare

Bug fixes and exception handling.

Full Changelog: v1.0.0...v1.0.1

v1.0.0 - stable version with new features, complete docs and examples

12 Jun 17:04

Choose a tag to compare

Features

  • Enrichment Analysis Methods:

    • Region Set Analysis:
      • LOLA: Genomic Locus Overlap Enrichment Analysis.
      • GREAT: Genomic Regions Enrichment of Annotations Tool using rGREAT.
      • pycisTarget: Motif enrichment analysis in region sets to identify high-confidence transcription factor (TF) cistromes.
    • Gene Set Analysis:
      • Over-representation Analysis (ORA): Using GSEApy's enrich() function.
      • RcisTarget: Motif enrichment analysis in gene sets to identify high-confidence TF cistromes.
    • Region-based Gene Set Analysis:
      • Region-gene associations obtained using (r)GREAT.
      • Complementary ORA using GSEApy and TFBS motif enrichment analysis using RcisTarget.
    • Preranked Gene Set Analysis:
      • Preranked GSEA using GSEApy's prerank() function.
  • Database Support:

    • Local databases for GSEApy and (r)GREAT
      • GMT files e.g., from MSigDB or Enrichr.
      • (custom) JSON file support.
    • LOLA databases from LOLA Region Databases or custom created.
    • cisTarget databases for pycisTarget and RcisTarget.
  • Group Aggregation:

    • Aggregation of results per method and database.
    • Filtered aggregation retaining only statistically significant terms.
  • Visualization:

    • Enrichment dot plots for each query, method, and database combination.
    • Hierarchically clustered heatmaps and bubble plots for group summaries.

Documentation

  • Usage Instructions:

    • Steps to download relevant databases and configure the analysis.
    • Commands for running the workflow and generating reports.
  • Examples: Provided example queries and databases with instructions for running a complete analysis.

  • Links and Resources:

    • GitHub repository, Zenodo repository, and Snakemake Workflow Catalog entry.
    • Recommended compatible MR.PARETO modules for upstream processing and analyses.
    • Web versions of some tools and databases for region/gene sets.

Beware: All packages got updated/changed to their latest versions, therefore results might differ. If possible, rerunning is recommended. The workflow expanded its functionality significantly, hence many changes were introduced especially in the configuration.

Thanks to early adopters @dariarom94, @Rubbert, and @bednarsky for testing and providing constructive feedback.

Bug fixes and performance improvements are not mentioned.

Full Changelog: v0.1.1...v1.0.0

v0.1.1 - small improvements, documentation and citation information

08 Apr 14:00
e1a7e2d

Choose a tag to compare

v0.1.0 - stable version with complete docs and examples

15 Jan 13:11

Choose a tag to compare

features

  • enrichment analysis methods
    • region-sets
      • LOLA: Genomic Locus Overlap Enrichment Analysis is run locally.
      • GREAT using rGREAT: Genomic Regions Enrichment of Annotations Tool is queried remotely (requires a working internet connection).
    • gene-sets
      • over-representation analysis (ORA) using GSEApy enrich() function performs Fisher’s exact test (i.e., hypergeometric test) and is run locally.
      • preranked gene-set enrichment analysis (preranked GSEA) using GSEApy prerank() function performs preranked GSEA and is run locally.

Note: All genomic region sets are subjected to gene-set ORA, leveraging region-gene associations of each query, and background region-set obtained using GREAT. Thereby, an extended region-set enrichment perspective can be gained by querying databases, that are not supported by region-based tools.

  • resources (databases) for both gene-based analyses are either downloaded (Enrichr) or copied from local JSON or GMT files.

    • all Enrichr databases can be queried (enrichr_dbs).
    • local JSON database files can be queried (local_json_dbs).
    • local GMT database files (e.g., from MSigDB) can be queried (local_gmt_dbs).
  • group aggregation of results per method and database

    • results of all queries belonging to the same group are aggregated per method and database.
    • a filtered version taking the union of all statistically significant terms per query is also saved.
  • visualization

    • region/gene-set specific enrichment dot plots are generated for each query, method, and database combination where the top terms are ranked (along the y-axis) by the mean rank of statistical significance, effect-size, and overlap with the goal to make the results more balanced and interpretable.
    • group summary/overview
      • the union of the most significant terms per query, method, and database within a group is determined.
      • their effect-size and statistical significance are visualized as hierarchically clustered heatmaps.
      • a hierarchically clustered bubble plot encoding both effect-size and significance is provided.

docuemntation

  • complete documentation of used software, all features, and methods
  • a minimal example to test all supported features
  • external resources