A set of tools to use Bayesian inference to infer enrichment of sequencing fragments in an extracted sample, i.e., Chip-seq, IPOD-HR, HBD-seq, etc., vs. input DNA fragments.
We note that Enricherator takes the term "sequencing fragments" seriously. Its statistical model works well for count data that quantify alignments of paried-end sequencing fragments. If you have single-end data, you must estimate the fragment size and extend the 3-prime ends of your alignments by the appropriate ammount to count pseudo-fragments.
This version (v 1.0.0) of Enricherator is a substantial improvement over the version most recentply published in JMB1, and of course, its original implementation on HBD-seq data2. Version 1.0.0 has the following benefits:
- Uses median read count normalization instead of a trimmed mean. Empirically, we have found that this just works much better, as mean normalization often resulted in enrichment estimates that were systematically shifted by a baseline factor. Median normalization does not have this problem.
- MUCH lower memory usage during post-fit steps
- Better use of shrinkage prior, user now specifies their prior belief for the fraction of the genome truly enriched in both their control data and their experimental (often called "extract" below) data
- Able to handle arbitrary contrasts between conditions
- Can subtract control enrichments from the experimental enrichments. For example, if, in a ChIP-seq experiment, the ChIP-seq against a protein of interest (we'll call this the "experimental ChIP-seq"), a ChIP against a strain knocked out for the protein of interest (we'll call this the "control ChIP-seq"), and input sequencing for both the expeperimental and control ChIP-seq, were performed, the enrichments inferred in the control ChIP-seq relative to its input signal can be subtracted from the enrichments inferred for the experimental ChIP-seq relative to its input signal. This is useful for removing spurious enrichments that have nothing to do with the actual signal of interest, so that genomic loci with spurious enrichments are not considered in downstream analyses. We often use it to subtract RNAP ChIP-seq signals from our IPOD signals to generate IPOD-HR signals.
Enricherator uses bedgraph files (in the near future we look forward to supporting use of bigwig files) containing fragment counts to infer enrichments. To use enricherator, the user must have counted paired-end alignment counts at their desired genome resolution (we often use 5-bp resolution), and have estimated the mean fragment size for their extracted data, their control data, if it exists, and their input sequencing data. Enricherator will use the fragment sizes to inform how much local genomic space over which to smooth enrichment estimates.
Once the above information and files are in hand, an "info file" must be prepared to indicate to Enircherator the salient information about each sequencing library, i.e., the condition of interest, the biological repliate id a given library represents, the strand represented by the data if a strand-specific analysis is desired, and to which input, control, and extract groups each sample belongs (see below for explanation of setting input, control, and extract groups in an info file).
Examples of bedgraph files and info files can be found in the "examples" directory of this repository.
The steps of running enricherator include:
- Prepare the data for Enricherator
- Fit the Enricherator model to data
- Enricherator will fit the model 5 independent times, concurrently, and will choose to proceed with the fit that achieved the highest ELBO.
- Parse the csv file with all parameter estimates to a binary file
- Parse the binary file
- Generates bedgraph files for all desired enrichment estimates and contrasts.
- Perform Bayesian peak calling (optional)
We provide a containerized version of Enricherator that should simplify its use. The container is build using Apptainer, which will allow the container to run on any Linux operating system with Apptainer installed. We do not yet have the container posted to a public repository such as SylabsCloud, so please reach out the Lydia (lydsf@umich.edu) to obtain the container if you would like to use Enricherator.
If you are able to use our containerized application, skip to the Running enrichment inference section.
To start using Enricherator, clone this repository and enter the directory created.
Then, to set up the conda environment, run
conda env create -f conda_environment.yamlThis will create the rstan conda environment.
NOTE: The environment still will not contain cmdstanr,
and you will have to install that separately, from
within the environment.
To set up cmdstanr, run the following:
conda activate rstan
RThen, from the R terminal, run:
install.packages("stringi", configure.args="--disable-pkg-config")
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
# check whether cmdstanr is properly set up
library(cmdstanr)
check_cmdstan_toolchain()The check_cmdstan_toolchain call has always returned that
the toolchain is properly set up, so I haven't had to
debug any issues that could arise here. However, if you
have issues, note that check_cmdstan_toolchain has
a fix argument that could be helpful.
We'll use the examples/subtraction_example directory for these instructions,
but bear in mind that references to files and directories in our example
directory should be adjusted to meet your needs.
Enter directory containing sample info file
(see examples/subtraction_example/example_info.csv and
examples/no_subtraction_example/example_info.csv for sample info file templates.
Run the code below; you will need to adjust several of the arguments
passed to the scripts below for your own usage. First,
set --extract_subsample_dist to half the mean extracted (experimental group; not control; not input)
data fragment length (<ext_frag_len/2> below),
set --extract_fragment_length to the mean extracted data fragment length (<ext_frag_len> below),
if doing control subtraction, set --control_subsample_dist
to half the mean control data fragment length (<cont_frag_len/2> below),
and set --control_fragment_length to the mean control data fragment length (<cont_frag_len> below).
set --input_subsample_dist to half the mean input data fragment length (<inp_frag_len/2> below),
and set --input_fragment_length to the mean input data fragment length (<inp_frag_len> below).
Depending on the type of experiment being performed, set --extract_frac_genome_enriched and
--control_frac_genome_enriched to values you feel appropriate. For these example data,
we used IPOD and RNAP ChIP-seq data from the original IPOD-HR paper3. Therefore,
it is reasonable to expect 10% of the genome to be occupied by protein in the IPOD signal (extract),
or by RNAP in the RNAP ChIP-seq signal (technically referred to as "control" here, although
the purpose of RNAP ChIP-seq signal subtraction is not actually to serve as a control per se in
IPOD-HR experiments). Think about the value you want to set here, because setting it much
higher than appropriate may result in false-positive enrichment calls, and setting
it much lower than appropriate may result in false-negative enrichment calls.
If you want to ignore any contigs in your reference fasta file, you
can set those contigs as a comma-separated list using the --ignore_ctgs
argument as shown below. To include all contigs, simply omit the
--ignore_ctgs argument.
In future versions of Enricherator we plan to support spike-in normalization,
but for now, the --libsize_key argument should be exactly as
shown below in order
to use median coverage as the size factor for normalization.
If not using apptainer, substitute your location of the Enricherator source code for "/src" in the code examples below and omit the line beginning with "apptainer".
cd examples/subtraction_example
direc=$(pwd)/example_results
apptainer exec -B $(pwd) /path/to/enricherator.sif \
Rscript /src/R/prep_data.R \
--info example_info.csv \
--data_direc "${direc}/data" \
--norm_method libsize \
--control_subsample_dist 50 \
--control_fragment_length 100 \
--input_subsample_dist 40 \
--input_fragment_length 80 \
--extract_subsample_dist 50 \
--extract_fragment_length 100 \
--extract_frac_genome_enriched 0.1 \
--control_frac_genome_enriched 0.1 \
--libsize_key "median_size_factors" \
--cores 4If this step ran successfully, you should see a directory called "data" in
the examples/subtraction_example directory. It should contain
files:
config.toml
data.RData
expt_key.json
position_map.txt To run enrichment inference, enter the examples/subtraction_examples
directory and run the following:
cd examples/subtraction_example
direc=$(pwd)/example_results
# this script will perform 5 independent fits of the enricherator model
# to the input data. It can transiently require a lot of storage space (100s of GB)
# after the fits have all converged this script will choose which fit
# acheived the highest ELBO and will remove files associated with the other 4 fits.
apptainer exec -B $(pwd) /path/to/enricherator.sif \
Rscript /src/R/enricherator_fit.R \
--compiled_model "/src/stan/relu_control_sub" \
--data_direc "${direc}/data" \
--cores 5 \
--grad_samps 1If inference fails, increment the --grad_samps argument by one and try again.
Once complete, the following files should be present in examples/subtraction_example/draws,
fit.log
fit.err
fit.RData
data.RData
draws/draws-1.csvNote that if you fit the model to genome-wide data,
draws-1.csv will be several hundreds of Gigabytes,
since it is the full output for 500 samples of the
approximate posterior for tens- to hundreds-of-millions of
parameters. These parameter estimates are filtered and written to
a binary file in the next step to retain a compressed version of
only the necessary information from draws-1.csv.
Once all steps of Enricherator are successfully run, we typically
delete draws-1.csv.
After the fitting step completes, YOU MUST EDIT YOUR config.toml file
in the data directory created in the first step. Here's where you'll
need to familiarize yourself with the parameter names in the Enricherator model.
- Alpha - inferred intercept values (represents input sequencing coverage)
- Beta - inferred control enrichments
- Gamma - inferred control-subtracted enrichments in your extracted (experimental) samples
- non_sub - inferred raw extract enrichments, with no control subtraction performed
You'll need to add a "parse" section to config.toml that could look something
like the following:
[parse]
# Tell Enricherator which parameters you want extracted from the draws-1.csv file.
# Here we're doing all of them.
vars = ["Alpha", "Beta", "Gamma", "non_sub"]
# Tell Enricherator the threshold for calculating evidence ratios for enrichment.
# NOTE: this is on the log2 scale, so the value of 1.0 below is 2^1, so 2-fold enrichment
threshold = 1.0
# Set the region of practical equivalence (ROPE) within which there is to be considered NO enrichment.
rope = 0.5To perform contrasts between genotypes (technically, the contrasts are performed in the following step,
but it's useful to bring this up here), you must tell Enricherator which parameters you want
contrasted, and within each parameter, which genotype are included in the contrast. Of course,
these conditions are not technically genotypes, but Enricherator doesn't care about those semantics.
It just want to do some math, so as long as the "genotypes" listed here match a "genotype" in you
sample info file from the prep_data.R call, this should work..
[geno_contrasts]
# Get the difference in control-subtracted enrichment (IPOD-HR signal in this example)
# between rich defined medium and minimal medium
Gamma = ["wt_m9_rdm_glu-wt_m9_min_glu"]
# Get the difference in control enrichment (RNAP ChIP-seq in this example,
# so here this is biologically meaninful!)
# between rich defined medium and minimal medium
Beta = ["wt_m9_rdm_glu-wt_m9_min_glu"]
# Same as for Gamma contrast above, but for the raw IPOD enrichments, which includes RNAP signal
non_sub = ["wt_m9_rdm_glu-wt_m9_min_glu"]
# threshold and rope here are the same as for the "parse" section above, but for the genoytpe contrasts
threshold = 0.0
rope = 0.5Note that multiple contrasts can be performed for any parameter. For example, if we also had, in addition to the glucose-containing rich defined medium (rdm_glu) and minimal medium (min_glu) conditions, a condition with succinate-minimal medium (min_suc), we could do the following to get contrasts for both rdm_glu-min_glu and min_suc-min_glu for the Gamma parameter:
[geno_contrasts]
Gamma = ["wt_m9_rdm_glu-wt_m9_min_glu", "wt_m9_min_suc-wt_m9_min_glu"]
threshold = 0.0
rope = 0.5Once you've updated the config.toml file with your desired "parse" and "geno_contrasts"
sections, run the following
to filter the large csv file, keeping the parameters of interest and writing
them to a binary file.
cd examples/subtraction_example
direc=$(pwd)/example_results
apptainer exec -B $(pwd) /path/to/enricherator.sif \
Rscript /src/R/parse_csv_to_blob.R \
--cfg_file ${direc}/data/config.tomlWe write a bedgraph file for each of the following summary statistics for each
genotype and strand:
mean of the 500 samples, lower 90% credible interval, upper 90% credible interval,
the evidence ratio
(K) for the enrichment being greater than the value provided at the
--threshold argument in config.toml,
and K for the enrichment being less than the --threshold value.
In addition, evidence ratios are written for the enrichment being in the ROPE,
or not in the ROPE.
Again, note that --threshold and --rope are on the log2-scale, so setting --threshold 1.0
denotes a fold-enrichment of 2, and setting --threshold 2.0 would
denote a fold-enrichment of 4. Likewise, setting --rope 0.5 would define a ROPE
interval on the log2 scale of [-0.5,0.5], which in fold-enrichment space would
be an interval of approximately [0.71,1.41].
Finally, to write the bedgraph files, run the following:
cd examples/subtraction_example
direc=$(pwd)/example_results
apptainer exec -B $(pwd) /path/to/enricherator.sif \
Rscript /src/R/parse_blob_file.R \
--cfg_file ${direc}/data/config.toml \Several new files will be produced by parse_blob_file.R.
- Bedgraph files containing enrichment scores on the log2 scale will have the suffix "_mean.bedgraph", and will appear in
example_results/draws. The files with "Beta" in their name contain log2(enrichment) scores for the control enrichments, the files with "Alpha" in their name are a measure of the log2(median-normalized input abundance), and files with "Gamma" in their name contain log2(control-subtracted enrichment). Robust z-scores, rolling medians, and rolling means can be calculated usingbgtools. - Evidence ratio files ending with "Kge.begraph" and "Klt.begraph" contain, for each genome position, the evidence ratio in favor of enrichment at this locus being greater than, or less than or equal to, respectively, the user-defined threshold in
config.toml. - Evidence ratio files ending with "K_in_rope.begraph" and "K_not_in_rope.begraph" contain, for each genome position, the evidence ratio in favor of enrichment at this locus being within, or outside, respectively, the user-defined ROPE in
config.toml. - Any png files in
example_results/drawscan be inspected to verify control subtraction performed as expected.
At this point, if you have inspected the quantities output by the
above script, and if they seem reasonable, it is recommended that you
delete the draws-1.csv file.
The evidence ratios provided by Enricherator
can be used to call peaks at any desired
level of evidence. We often use the evidence ratios pointed to by
Kass and Raftery4, with K
While we do not provide peak calling utilities directly in this repository or in the Enricherator apptainer container, we share our basic workflow for peak calling below.
Let us say that we ran Enricherator on ChIP-seq data and want to call peaks in genome-wide Gamma estimates, and that we define our threshold on the log2-scale to be 1.0. To call peaks we would then need to identify contiguous regions of the genome with an evidence ratio supporting the hypothesis for enrichment above 1.0 that exceeds our evidence threshold.
For this example, we will require "very strong" evidence (K bgtools
in this example to merge contiguous bedgraph files into bed formatted regions.
OUTDIR="example_results/draws"
K_thresh=150
min_width=50
# we will call peaks in every condition using the following loop
# The bedgraph files with K_gt in their names contain the evidence ratios for
# samples from the approximate posterior being greater than the threshold, so
# those are the files we'll use for peak calling
for bgfile in ${OUTDIR}/Gamma_*_Kge.bedgraph; do
# strip the file extension and path to generate new file name
base=$(basename $bgfile .bedgraph)
# get the path only
direc=$(dirname $bgfile)
# now make the new file path
outfile="${direc}/${base}_Kgeq_${K_thresh}.bed"
echo "Creating ${outfile} by filtering ${bgfile}"
# use awk to print only lines exceeding the evidence threshold,
# pipe the result to bgtools to get just the contiguous regions as bed format
# pipe the bed formatted regions to awk to filter by size.
awk -v thresh=$K_thresh '$4 >= thresh {print}' $bgfile \
| bgtools contiguous_regions -i - \
| awk -v width=$min_width 'BEGIN{OFS=FS="\t"} $3-$2 >= width {print}' \
> $outfile
doneTo run your own Enricherator jobs, we recommend using our example info csv files as templates to build out your own csv file. A few notes to clarify the purpose of the final three columns:
- input_group
- control_group
- extract_group
The values in these columns must be non-negative integers, as they are indicators used to assign each bedgraph file's count data to a specific role in the Enricherator model. Input sequencing counts should have a non-zero value in the input_group column, with zeros for the other two columns. Control sequencing counts should have the same number in their input_group column as the input sequencing counts that should be used as their input, a non-zero number in their control_group column, and zero in their extract_group column. Extract sequencing counts should have the same number in their input_group column as the input data that should be used for this particular set of extract data, the same number in their control_group column as the control data that should be used to subtract control enrichments from extract enrichments, and their own value denoting to which extract_group this row's sequencing counts belong.
Here are two examples.
Consider the following schematic for an experiment in which we want to identify changes in protein enrichment in Condition B vs. Condition A. We did three biological replicates.
Each extract ChIP-seq dataset has its corresponding input DNA dataset, and the same is true for each control ChIP-seq dataset. The sample information csv's input, control, and extract_group columns should look like the following (here we're only including the necessary columns of an info sheet for the purposes of explaining the groupings, and we leave it to the user to fill in the rest for their own purposes, in their own experimental designs):
| genotype | sample | rep | input_group | control_group | extract_group |
|---|---|---|---|---|---|
| condition_A | input | 1 | 1 | 0 | 0 |
| condition_A | input | 2 | 1 | 0 | 0 |
| condition_A | input | 3 | 1 | 0 | 0 |
| condition_A | control | 1 | 1 | 1 | 0 |
| condition_A | control | 2 | 1 | 1 | 0 |
| condition_A | control | 3 | 1 | 1 | 0 |
| condition_A | input | 1 | 2 | 0 | 0 |
| condition_A | input | 2 | 2 | 0 | 0 |
| condition_A | input | 3 | 2 | 0 | 0 |
| condition_A | extract | 1 | 2 | 1 | 1 |
| condition_A | extract | 2 | 2 | 1 | 1 |
| condition_A | extract | 3 | 2 | 1 | 1 |
| condition_B | input | 1 | 3 | 0 | 0 |
| condition_B | input | 2 | 3 | 0 | 0 |
| condition_B | input | 3 | 3 | 0 | 0 |
| condition_B | control | 1 | 3 | 2 | 0 |
| condition_B | control | 2 | 3 | 2 | 0 |
| condition_B | control | 3 | 3 | 2 | 0 |
| condition_B | input | 1 | 4 | 0 | 0 |
| condition_B | input | 2 | 4 | 0 | 0 |
| condition_B | input | 3 | 4 | 0 | 0 |
| condition_B | extract | 1 | 4 | 2 | 2 |
| condition_B | extract | 2 | 4 | 2 | 2 |
| condition_B | extract | 3 | 4 | 2 | 2 |
Note that all three biological replicates for a given group share group identifiers!
As a bonus, let's look at what an example "parse" and "geno_contrasts" section
of your config.toml file might look like.
We care about the enrichments of the ALFA-tagged protein over control (Gamma parameter), and we also want to record the non-control-subtracted enrichments, just to be safe (non_sub parameter). And of course, the whole point was to compare protein enrichment between conditions B and A. Let's say that we're looking for extreme enrichments, so anything with 4-fold (2.0 on the log2 scale) enrichment is what we want to consider "enriched". We'll also say that any 2-fold difference in enrichment in condition B vs. condition A is something we're interested in, which will define our choice of "threshold" in the "geno_contrasts" section.
Here's what the "parse" and "geno_contrasts" sections of config.toml should look like.
[parse]
# Tell Enricherator to gather Gamma and non_sub parameters from the draws-1.csv file.
vars = ["Gamma", "non_sub"]
# Tell Enricherator the threshold for calculating evidence ratios for enrichment.
# NOTE: this is on the log2 scale, so the value of 2.0 below is 2^2, so 4-fold enrichment
threshold = 2.0
# Set the region of practical equivalence (ROPE) within which there is to be considered NO enrichment.
rope = 0.5
[geno_contrasts]
# Get the difference in control-subtracted enrichment between conditions B and A
# note these names must match the names in the sample sheet csv
Gamma = ["condition_B - condition_A"]
# Same as for Gamma contrast above, but for the non-control-subtracted enrichments
non_sub = ["condition_B - condition_A"]
threshold = 1.0
rope = 0.5Now consider the following schematic representing a typical IPOD experiment. Here things are a little different, in that the IPOD and RNAP ChIP-seq samples came from the same original tube, and thus share input DNA. We did two biological replicates this time.
Here's what the salient columns of the sample info file would look like:
| genotype | sample | rep | input_group | control_group | extract_group |
|---|---|---|---|---|---|
| condition_A | input | 1 | 1 | 0 | 0 |
| condition_A | input | 2 | 1 | 0 | 0 |
| condition_A | control | 1 | 1 | 1 | 0 |
| condition_A | control | 2 | 1 | 1 | 0 |
| condition_A | extract | 1 | 1 | 1 | 1 |
| condition_A | extract | 2 | 1 | 1 | 1 |
| condition_B | input | 1 | 2 | 0 | 0 |
| condition_B | input | 2 | 2 | 0 | 0 |
| condition_B | control | 1 | 2 | 2 | 0 |
| condition_B | control | 2 | 2 | 2 | 0 |
| condition_B | extract | 1 | 2 | 2 | 2 |
| condition_B | extract | 2 | 2 | 2 | 2 |
Note how the identifiers in the grouping columns point the conrol and extract data within a given condition to the same input DNA.
Footnotes
-
Schroeder, Jeremy W., and P. Lydia Freddolino. 2024. “Enricherator: A Bayesian Method for Inferring Regularized Genome-Wide Enrichments from Sequencing Count Data.” Journal of Molecular Biology, April, 168567. ↩
-
Schroeder, Jeremy W., Rebecca L. Hurto, Justin R. Randall, Katherine J. Wozniak, Taylor A. Timko, Taylor M. Nye, Jue D. Wang, Peter L. Freddolino, and Lyle A. Simmons. 2023. “RNase H Genes Cause Distinct Impacts on RNA:DNA Hybrid Formation and Mutagenesis Genome Wide.” Science Advances 9 (30): eadi5945. ↩
-
Freddolino, P.L., Amemiya, H.M., Goss, T.J., and Tavazoie, S. (2021). Dynamic landscape of protein occupancy across the Escherichia coli chromosome. PLoS Biol. 19, e3001306. ↩
-
Kass, Robert E., and Adrian E. Raftery. 1995. “Bayes Factors.” Journal of the American Statistical Association 90 (430): 773–95. ↩

