Releases: epigen/enrichment_analysis
v3.0.0 - Configurable GREAT region annotation and smaller aggregate outputs
This release changes how GREAT-associated regions and genes are exported to reduce runtime, file size, duplicated data, and Excel compatibility problems caused by very large annotation cells.
Features (breaking change)
- Added
great_parameters:map_associated_regionsin the config to control how many significant GREAT terms are annotated with associated query regions and genes in individual query result tables. This update to the configuration file represents the breaking change. - The new default is
1, so users can inspect an example annotated term without paying the cost of annotating every significant term. - Supported values:
0: do not annotate GREAT terms with associated regions/genes- positive integer, for example
5: annotate that many top significant terms ranked by the configured adjusted p-value column -1: annotate all significant terms, restoring the previous behavior
Changes
- GREAT group-level aggregated result CSV files no longer include the
regionsandannotated_genescolumns. - These columns remain available in individual GREAT query result CSV files when annotation is enabled with
map_associated_regions. - This avoids duplicating very large region/gene annotation strings across aggregate files and prevents aggregate outputs from becoming unnecessarily large.
- GREAT parameters are passed through Snakemake rule
paramsinstead of being read directly from the config inside the GREAT analysis script.
Documentation
- Updated the example config and schema with
great_parameters:map_associated_regions. - Updated the README and config documentation to warn that GREAT region/gene annotation can take a long time, substantially increase file size, and break Excel usage because cells can exceed Excel's 32,767 character limit.
What's Changed
- Limit GREAT region annotations by @martin5555555555 in #63
Full Changelog: v2.1.0...v3.0.0
v2.1.0 - Additional genomic regions analysis outputs and new summary visualization
This release expands the enrichment workflow with the addition of a 'specific' group summary plot, additional pycisTarget and GREAT outputs, more robust handling of sparse or empty inputs, and updated documentation and infrastructure.
Features
- The group summary outputs are now
*_summary_topTerms.pngand*_summary_specificTerms.png, a new group-level summary plot highlighting terms that are more specific to individual groups. - Extended pycisTarget outputs with additional exported tables for motif hits and cistromes.
- Extended GREAT result tables with associated query regions and annotated genes for significant terms.
- Added
helpers/features_to_bed.py, a helper script for converting feature ID lists into BED files by mapping IDs to genomic coordinates from an annotation table. - Added schema validation for workflow configuration and annotation files in
workflow/schemas/.
Changes
- Removed the previous heatmap-based group summary outputs, now it is only bubble plots.
- Removed filtered aggregate
_sig.csvoutputs; group-level aggregation now focuses on the complete_all.csvtable and summary plots. - Updated the example configuration and test setup. The default config now runs a minimal test spanning all workflow functionalities. The data for this test is restored via
test/setup_test_resources.sh.
Bug Fixes
- Improved handling of empty BED, query, and background inputs in region-based workflows.
- Fixed coordinate convention mismatches between BED inputs and GREAT-derived region-gene association outputs. Adopted exporting coordinates in BED-style indexing as the convention.
- Improved aggregation and summary plotting for sparse result sets, including very small matrices and empty outputs.
- Added stable fallback behavior for empty enrichment and summary plots so reporting remains informative when no significant results are found.
Documentation
- Expanded guidance on skipping selected enrichment tools when they are not needed.
- Added example Snakemake rule templates for downloading common enrichment resources, including gene set databases, cisTarget resources, and LOLA region databases in
/helpers/database_download_rules.md. - Updated the README to describe the current output structure, BED indexing conventions, and new summary plot outputs.
Infrastructure
- Added GitHub Actions for CI, container image generation, and conda environment pinning.
- Added Snakemake containerization support.
What's Changed
- Add summary plot for specific terms & remove summary heatmaps. by @bednarsky in #37
- accept empty bed files, inform users for empty plots, tested the top-n plot for all softwares, handles the +1 shift of Irange ok by @martin5555555555 in #40
- Pycis and Great retrieval, download rules helpers, config validation schema by @martin5555555555 in #42
- new tested wget download rules by @martin5555555555 in #46
- Refactor plot messages and annotation loading by @martin5555555555 in #48
- Test data and github action yaml by @martin5555555555 in #47
- Containerize by @martin5555555555 in #49
- Containerize + Pinning envs by @martin5555555555 in #50
- Update container.yaml to setup resources by @martin5555555555 in #51
- Still containerize by @martin5555555555 in #52
- new secret by @martin5555555555 in #53
- Pycis env version + Pin action by @martin5555555555 in #54
- Update pin-conda-envs.yaml by @martin5555555555 in #55
- Update conda env pins by @github-actions[bot] in #56
- Update container.yaml by @martin5555555555 in #57
- update pycisTarget env by @martin5555555555 in #59
- Update conda env pins by @github-actions[bot] in #60
- Update pycisTarget.yaml by @martin5555555555 in #61
New Contributors
- @martin5555555555 made their first contribution in #40
- @github-actions[bot] made their first contribution in #56
Full Changelog: v2.0.3...v2.1.0
v2.0.3 - minor improvement
Full Changelog: v2.0.2...v2.0.3
v2.0.2 - Minor fixes
- Make all resource files input to rules
Full Changelog: v2.0.1...v2.0.2
v2.0.1 - enable module usage using `github()` directive
- to enable module usage using
github()directive- source
utils.R viaparamsinstead ofsnakemake@source` - comment
global.yaml(now requires full snakemake installation, not minimal)
- source
- add nodefaults to all env YAML and comment global.env usage
- fix stringi version
What's Changed
- Fixing stringi version to fix env by @bednarsky in #27
New Contributors
- @bednarsky made their first contribution in #27
Full Changelog: v2.0.0...v2.0.1
v2.0.0 - Snakemake 8 compatible
Breaking change: Requires Snakemake >= v8.20.1
Full Changelog: v1.0.1...v2.0.0
v1.0.1 - bug fixes and exception handling
Bug fixes and exception handling.
Full Changelog: v1.0.0...v1.0.1
v1.0.0 - stable version with new features, complete docs and examples
Features
-
Enrichment Analysis Methods:
- Region Set Analysis:
- LOLA: Genomic Locus Overlap Enrichment Analysis.
- GREAT: Genomic Regions Enrichment of Annotations Tool using rGREAT.
- pycisTarget: Motif enrichment analysis in region sets to identify high-confidence transcription factor (TF) cistromes.
- Gene Set Analysis:
- Over-representation Analysis (ORA): Using GSEApy's enrich() function.
- RcisTarget: Motif enrichment analysis in gene sets to identify high-confidence TF cistromes.
- Region-based Gene Set Analysis:
- Region-gene associations obtained using (r)GREAT.
- Complementary ORA using GSEApy and TFBS motif enrichment analysis using RcisTarget.
- Preranked Gene Set Analysis:
- Preranked GSEA using GSEApy's prerank() function.
- Region Set Analysis:
-
Database Support:
- Local databases for GSEApy and (r)GREAT
- GMT files e.g., from MSigDB or Enrichr.
- (custom) JSON file support.
- LOLA databases from LOLA Region Databases or custom created.
- cisTarget databases for pycisTarget and RcisTarget.
- Local databases for GSEApy and (r)GREAT
-
Group Aggregation:
- Aggregation of results per method and database.
- Filtered aggregation retaining only statistically significant terms.
-
Visualization:
- Enrichment dot plots for each query, method, and database combination.
- Hierarchically clustered heatmaps and bubble plots for group summaries.
Documentation
-
Usage Instructions:
- Steps to download relevant databases and configure the analysis.
- Commands for running the workflow and generating reports.
-
Examples: Provided example queries and databases with instructions for running a complete analysis.
-
Links and Resources:
- GitHub repository, Zenodo repository, and Snakemake Workflow Catalog entry.
- Recommended compatible MR.PARETO modules for upstream processing and analyses.
- Web versions of some tools and databases for region/gene sets.
Beware: All packages got updated/changed to their latest versions, therefore results might differ. If possible, rerunning is recommended. The workflow expanded its functionality significantly, hence many changes were introduced especially in the configuration.
Thanks to early adopters @dariarom94, @Rubbert, and @bednarsky for testing and providing constructive feedback.
Bug fixes and performance improvements are not mentioned.
Full Changelog: v0.1.1...v1.0.0
v0.1.1 - small improvements, documentation and citation information
v0.1.0 - stable version with complete docs and examples
features
- enrichment analysis methods
- region-sets
- gene-sets
- over-representation analysis (ORA) using GSEApy enrich() function performs Fisher’s exact test (i.e., hypergeometric test) and is run locally.
- preranked gene-set enrichment analysis (preranked GSEA) using GSEApy prerank() function performs preranked GSEA and is run locally.
Note: All genomic region sets are subjected to gene-set ORA, leveraging region-gene associations of each query, and background region-set obtained using GREAT. Thereby, an extended region-set enrichment perspective can be gained by querying databases, that are not supported by region-based tools.
-
resources (databases) for both gene-based analyses are either downloaded (Enrichr) or copied from local JSON or GMT files.
- all Enrichr databases can be queried (enrichr_dbs).
- local JSON database files can be queried (local_json_dbs).
- local GMT database files (e.g., from MSigDB) can be queried (local_gmt_dbs).
-
group aggregation of results per method and database
- results of all queries belonging to the same group are aggregated per method and database.
- a filtered version taking the union of all statistically significant terms per query is also saved.
-
visualization
- region/gene-set specific enrichment dot plots are generated for each query, method, and database combination where the top terms are ranked (along the y-axis) by the mean rank of statistical significance, effect-size, and overlap with the goal to make the results more balanced and interpretable.
- group summary/overview
- the union of the most significant terms per query, method, and database within a group is determined.
- their effect-size and statistical significance are visualized as hierarchically clustered heatmaps.
- a hierarchically clustered bubble plot encoding both effect-size and significance is provided.
docuemntation
- complete documentation of used software, all features, and methods
- a minimal example to test all supported features
- external resources