Skip to content

Commit e1322a2

Browse files
committed
Merge branch 'dev' of github.com:databio/MIRA
2 parents e5e8d1d + cb48f63 commit e1322a2

File tree

3 files changed

+37
-19
lines changed

3 files changed

+37
-19
lines changed

DESCRIPTION

+8-5
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,14 @@ Package: MIRA
22
Version: 1.7.0
33
Date: 2018-10-24
44
Title: Methylation-Based Inference of Regulatory Activity
5-
Description: MIRA measures the degree of "dip" in methylation
6-
level surrounding a regulatory site of interest, such as a
7-
transcription factor binding site, for instances of that
8-
type of site across the genome which can then be used to
9-
infer regulatory activity.
5+
Description: DNA methylation contains information about the regulatory state of the cell.
6+
MIRA aggregates genome-scale DNA methylation data into a DNA methylation profile
7+
for a given region set with shared biological annotation. Using this profile,
8+
MIRA infers and scores the collective regulatory activity for the region set.
9+
MIRA facilitates regulatory analysis in situations where classical regulatory
10+
assays would be difficult and allows public sources of region sets to be
11+
leveraged for novel insight into the regulatory state of DNA methylation
12+
datasets.
1013
Authors@R: c(person("Nathan", "Sheffield", role=c("aut")),
1114
person("John", "Lawson", email="[email protected]", role=c("aut", "cre")),
1215
person("Christoph", "Bock", role=c("ctb")))

R/MIRA.R

+15
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,21 @@ aggregateMethylInt <- function(BSDT, GRList, binNum = 11, minBaseCovPerBin = 500
202202
stop("GRList should be a GRangesList or a list of data.tables")
203203
}
204204

205+
# warning user if not all regions in a region set are the same length
206+
# which could make the resulting MIRA profile hard to interpret
207+
hasUniformRegionSizes = all(sapply(GRDTList, function(x)
208+
diff(range(x$end - x$start)) < 0.000001))
209+
if (!hasUniformRegionSizes) {
210+
warning(cleanws("For at least one of the region sets, not all regions
211+
in the set were the same length. It is recommended
212+
that all regions in a given region
213+
set be the same length so the final profile will be
214+
easier to interpret. Regions can be resized outside MIRA
215+
to meet this recommendation."))
216+
}
217+
218+
219+
205220

206221
# adding a methylProp column if it is not already in the BSDT
207222
if (!("methylProp" %in% names(BSDT))) {

vignettes/GettingStarted.Rmd

+14-14
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,10 @@ vignette: >
1010
%\VignetteEncoding{UTF-8}
1111
---
1212

13-
This vignette gives a broad overview of MIRA. For a more realistic example workflow, see the vignette "Applying MIRA to a Biological Question."
13+
# The MIRA Bioconductor package
14+
15+
*This vignette gives a broad overview of MIRA. You can find a realistic example workflow, in "Applying MIRA to a Biological Question."*
1416

15-
# MIRA Overview
1617
MIRA (Methylation-based Inference of Regulatory Activity) is an R package that infers regulatory activity from DNA methylation data. It does this by aggregating DNA methylation data from a set of regions across the genome, producing a single summary profile of DNA methylation for those regions. MIRA then uses this profile to produce a score (the MIRA score) that infers the level of regulatory activity on the basis of the shape of the DNA methylation profile. The region set is provided by the user and should contain regions that correspond to a shared feature, such as binding of a particular transcription factor or DNase hypersensitivity sites. The concept of MIRA relies on the observation that DNA methylation tends to be lower in regions where transcription factors are bound. Since DNA methylation will generally be lower in active regions, the shape of the MIRA profile and the associated score can be used as a metric to compare regulatory activity in different samples and conditions. MIRA thus allows one to predict transcription factor activity data given only DNA methylation data as input. MIRA works with genome-scale, single-nucleotide-resolution methylation data, such as reduced representation bisulfite sequencing (RRBS) or whole genome bisulfite sequencing (WGBS) data. MIRA overcomes sparsity in DNA methylation data by aggregating across many regions. Below are examples of methylation profiles and associated MIRA scores for two (contrived) samples. This vignette will demonstrate how to obtain a methylation profile and MIRA score from the starting inputs of DNA methylation data and a region set.
1718

1819
```{r, echo=FALSE, warning=FALSE}
@@ -29,7 +30,7 @@ exScores <- cbind(exScores, sampleType)
2930
exScores
3031
```
3132

32-
# Required Inputs
33+
## Required inputs
3334

3435
You need 2 things to run MIRA:
3536

@@ -38,7 +39,7 @@ You need 2 things to run MIRA:
3839

3940
Let's describe each one in more detail:
4041

41-
## DNA Methylation Data
42+
### DNA methylation data
4243

4344
MIRA requires DNA methylation data after methylation calling. For a given genomic coordinate (the location of the C in a CpG), MIRA needs the methylation level (0 to 1) or data that can be used to calculate this: counts of methylated reads and total reads. The total number of reads can be used for screening purposes. This data should be represented as a `data.table` for each sample, which we call a BSDT (Bisulfite data.table). The BSDT will have these columns: `chr`, `start` (the coordinate of the C of the CpG) and the `methylProp` column with methylation level at each cytosine. Alternatively, the `methylProp` column may be generated from `methylCount` (number of methylated reads) and `coverage` (total number of reads covering this site) columns where `methylProp` = `methylCount`/`coverage` and these columns may be included in the `data.table` in addition to the `methylProp` column. When running MIRA on multiple samples, it is recommended to make each sample an item of a named list, with sample names as the names for the list. Since some existing R packages for DNA methylation use different formats, we include a format conversion function that can be used to convert `SummarizedExperiment`-based objects like you would obtain from the `bsseq`, `methylPipe`, and `BiSeq` packages to the necessary format for MIRA (`SummarizedExperimentToDataTable` function). A `BSseq` object is an acceptable input to MIRA and will be converted to the right format internally but for the sake of parallelization it is recommended that `BSseq` objects be converted to the right format outside the MIRA function so that MIRA can be run on each sample in parallel. Here is an example of a `data.table` in the right format for input to MIRA:
4445

@@ -47,22 +48,22 @@ data("exampleBSDT", package="MIRA")
4748
head(exampleBSDT)
4849
```
4950

50-
## Region Sets
51+
### Region sets
5152
A region set is a GRanges object containing genomic regions that share a biological annotation. For example, it could be ChIP peaks for a transcription factor. Many types of region sets may be used with MIRA, including ChIP regions for transcription factors or histone modifications, promoters for a set of related genes, sites of motif matches, or DNase hypersensitivity sites. Many such region sets may be found in online repositories and we have pulled together some major sources at http://databio.org/regiondb. At the end of this vignette, we show how to load a database of regions with a few lines of code using the `LOLA` package. You may also want to check out [the `AnnotationHub` Bioconductor package](https://www.bioconductor.org/packages/release/bioc/html/AnnotationHub.html) which gives access to many region sets in an R-friendly format. For use in MIRA, each region set should be a GRanges object and multiple region sets may be passed to MIRA as a GRangesList with each list element being a region set. Here is an example of a region set, which we will use in this vignette:
5253

5354
```{r, message=FALSE}
5455
data("exampleRegionSet", package="MIRA")
5556
head(exampleRegionSet)
5657
```
5758

58-
# Analysis Workflow
59+
## Analysis workflow
5960
The general workflow is as follows:
6061
1. Data inputs: start with single-nucleotide resolution methylation data and one or more sets of genomic regions, as described above.
6162
2. Expand the regions sizes so that MIRA will be able to get a broad methylation profile surrounding your feature of interest. All regions should be expanded to the same final size.
6263
3. Aggregate methylation data across regions to get a MIRA profile.
6364
4. Calculate MIRA score based on shape of MIRA profile.
6465

65-
## The Input
66+
### The input
6667
First let's load the necessary libraries and example data. `exampleBSDT` is a data.table containing genomic location, number of reads methylated (`methylCount`), and total read number (`coverage`) from bisulfite sequencing of a lymphoblastoid cell line. `exampleRegionSet` is a GRanges object with regions that share a biological annotation, in this case Nrf1 binding in a different lymphoblastoid cell line.
6768

6869
```{r, message=FALSE}
@@ -85,7 +86,7 @@ Since our input methylation data did not have a column for proportion of methyla
8586
BSDTList <- addMethPropCol(BSDTList)
8687
```
8788

88-
## Expand Your Regions
89+
### Expand your regions
8990

9091
A short but important step. MIRA scores are based on the difference between methylation surrounding the regions versus the methylation at the region itself, so we need to expand our regions to include surrounding bases. Let's use the `resize` function from the `GenomicRanges` package to increase the size of our regions and make them all the same size. It is recommended to make all regions the same size so the final methylation profile will be easier to interpret. The average size of our regions is currently about 500:
9192

@@ -107,7 +108,7 @@ exampleRegionSetGRL <- GRangesList(exampleRegionSet)
107108
names(exampleRegionSetGRL) <- "lymphoblastoid_NRF1"
108109
```
109110

110-
## Aggregation of Methylation across Regions
111+
### Aggregating methylation across regions
111112
Next we aggregate methylation across regions to get a summary methylation profile with the `aggregateMethyl` function. MIRA divides each region into bins, aggregates methylation levels for cytosines within a given bin for each region individually, and then aggregates matching bins over all the regions (all 1st bins together, 2nd bins together, etc.). A couple important parameters: The binNum argument determines how many approximately equally sized bins each region is split into. This could affect the resolution or noisiness of the MIRA profile because using more bins will result in smaller bin sizes (potentially increasing resolution) but also less reads per bin (potentially increasing noise). The minBaseCovPerBin argument in aggregateMethyl is used to screen out region sets that have any bins with less than a minimum coverage. Here we use the default (500). Let's aggregate the methylation and then view the MIRA profile.
112113

113114
```{r Aggregate_methylation, message=FALSE, warning=FALSE}
@@ -124,7 +125,7 @@ bigBinDT = cbind(bigBinDT, sampleName)
124125
plotMIRAProfiles(binnedRegDT=bigBinDT)
125126
```
126127

127-
## Calculating the MIRA Score
128+
### Calculating the MIRA score
128129
To calculate MIRA scores based on the MIRA profiles, we will apply the scoring function, `calcMIRAScore`, to the `data.table` that contains the methylation aggregated in bins. `calcMIRAScore` calculates a score for each group of bins corresponding to a sample and region set, based on the degree of the dip in the profile. With MIRA's default scoring method (`logratio`), `calcMIRAScore` will take the log of the ratio of the outside edges of the dip (identified by `calcMIRAScore`) to the center of the dip. Higher MIRA scores are associated with deeper dips. A flat MIRA profile would have a score of 0.
129130

130131
```{r Scoring, warning=FALSE}
@@ -134,13 +135,12 @@ sampleScores <- calcMIRAScore(bigBinDT,
134135
head(sampleScores)
135136
```
136137

137-
## A Note on Annotation
138-
For real uses of MIRA, samples and region sets should be annotated. In order to save memory, it is not recommended that sample name or sample type be included as columns in the BSDT before the aggregation step. An annotation file that matches sample name with sample type (`sampleType` column) can be used to add sample type information to the data.table after aggregating methylation. A sampleType column is not required for the main functions but is used for the plotting functions. See "Applying MIRA to a Biological Question" vignette for a workflow including annotation.
138+
*A note on annotation*. For real uses of MIRA, samples and region sets should be annotated. In order to save memory, it is not recommended that sample name or sample type be included as columns in the BSDT before the aggregation step. An annotation file that matches sample name with sample type (`sampleType` column) can be used to add sample type information to the data.table after aggregating methylation. A sampleType column is not required for the main functions but is used for the plotting functions. See "Applying MIRA to a Biological Question" vignette for a workflow including annotation.
139139

140-
# Interpreting the Results
140+
## Interpreting the results
141141
Regulatory information can be inferred from the MIRA scores and profiles but interpretation depends on the type of region set and samples used. The basic assumption is that deeper dips and higher MIRA scores would be associated with generally higher activity at the regions that were tested, which might also inform you about the activity of the regulatory feature that defines the region set (eg the activity of a transcription factor if using ChIP-seq regions). However, MIRA does not directly tell you the activity of the regulatory feature and, in some cases, higher MIRA scores will not be associated with higher activity of the regulatory feature. For example, samples with higher MIRA scores for a transcription factor binding region set could have more activity of that transcription factor but there could be other causes like increased activity of a different transcription factor that binds to many of the same regions. Additionally, it is possible that the samples with higher scores generally had more similar chromatin states to the samples from which the region set was derived than samples with lower scores did (eg when using MIRA on samples from multiple tissue/cell types and the region set was derived from the same tissue/cell type as some but not all of the samples). The general interpretation rule is that a higher MIRA score means more activity but we encourage the user to think carefully about what biological question they are actually addressing based on the region sets and samples used. For more on interpreting the results, see the "Applying MIRA to a Biological Question" vignette.
142142

143-
# Bonus: Loading Region Sets with LOLA
143+
## Bonus: Loading region sets with LOLA
144144
Here is one simple way to get region sets to use with MIRA. First, download the LOLA Core Database (actual files, not the cached version) from [here](http://cloud.databio.org/regiondb/) (this might take a while).
145145
Next, let's load the `LOLA` package and use it to load some of the region sets into R!
146146

0 commit comments

Comments
 (0)