Merge branch 'dev' of github.com:databio/MIRA

j-lawson · j-lawson · commit e1322a2f66f0 · 2019-10-24T16:11:22.000-04:00
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -2,11 +2,14 @@ Package: MIRA
 Version: 1.7.0
 Date: 2018-10-24
 Title: Methylation-Based Inference of Regulatory Activity
-Description: MIRA measures the degree of "dip" in methylation
-    level surrounding a regulatory site of interest, such as a
-    transcription factor binding site, for instances of that
-    type of site across the genome which can then be used to 
-    infer regulatory activity.
+Description: DNA methylation contains information about the regulatory state of the cell.
+	MIRA aggregates genome-scale DNA methylation data into a DNA methylation profile
+	for a given region set with shared biological annotation. Using this profile,
+	MIRA infers and scores the collective regulatory activity for the region set.
+	MIRA facilitates regulatory analysis in situations where classical regulatory
+	assays would be difficult and allows public sources of region sets to be
+	leveraged for novel insight into the regulatory state of DNA methylation
+	datasets.
 Authors@R: c(person("Nathan", "Sheffield", role=c("aut")),
 	person("John", "Lawson", email="jtl2hk@virginia.edu", role=c("aut", "cre")),
 	person("Christoph", "Bock", role=c("ctb")))
diff --git a/R/MIRA.R b/R/MIRA.R
@@ -202,6 +202,21 @@ aggregateMethylInt <- function(BSDT, GRList, binNum = 11, minBaseCovPerBin = 500
         stop("GRList should be a GRangesList or a list of data.tables")
     }
     
+    # warning user if not all regions in a region set are the same length
+    # which could make the resulting MIRA profile hard to interpret
+    hasUniformRegionSizes = all(sapply(GRDTList, function(x) 
+                                    diff(range(x$end - x$start)) < 0.000001)) 
+    if (!hasUniformRegionSizes) {
+        warning(cleanws("For at least one of the region sets, not all regions
+                        in the set were the same length. It is recommended 
+                        that all regions in a given region
+                        set be the same length so the final profile will be 
+                        easier to interpret. Regions can be resized outside MIRA
+                        to meet this recommendation."))
+    }
+    
+    
+    
     
     # adding a methylProp column if it is not already in the BSDT
     if (!("methylProp" %in% names(BSDT))) {
diff --git a/vignettes/GettingStarted.Rmd b/vignettes/GettingStarted.Rmd
@@ -10,9 +10,10 @@ vignette: >
   %\VignetteEncoding{UTF-8}
 ---
 
-This vignette gives a broad overview of MIRA. For a more realistic example workflow, see the vignette "Applying MIRA to a Biological Question."
+# The MIRA Bioconductor package
+
+*This vignette gives a broad overview of MIRA. You can find a realistic example workflow, in "Applying MIRA to a Biological Question."*
 
-# MIRA Overview
 MIRA (Methylation-based Inference of Regulatory Activity) is an R package that infers regulatory activity from DNA methylation data. It does this by aggregating DNA methylation data from a set of regions across the genome, producing a single summary profile of DNA methylation for those regions. MIRA then uses this profile to produce a score (the MIRA score) that infers the level of regulatory activity on the basis of the shape of the DNA methylation profile. The region set is provided by the user and should contain regions that correspond to a shared feature, such as binding of a particular transcription factor or DNase hypersensitivity sites. The concept of MIRA relies on the observation that DNA methylation tends to be lower in regions where transcription factors are bound. Since DNA methylation will generally be lower in active regions, the shape of the MIRA profile and the associated score can be used as a metric to compare regulatory activity in different samples and conditions. MIRA thus allows one to predict transcription factor activity data given only DNA methylation data as input. MIRA works with genome-scale, single-nucleotide-resolution methylation data, such as reduced representation bisulfite sequencing (RRBS) or whole genome bisulfite sequencing (WGBS) data. MIRA overcomes sparsity in DNA methylation data by aggregating across many regions. Below are examples of methylation profiles and associated MIRA scores for two (contrived) samples. This vignette will demonstrate how to obtain a methylation profile and MIRA score from the starting inputs of DNA methylation data and a region set.
 
 ```{r, echo=FALSE, warning=FALSE}
@@ -29,7 +30,7 @@ exScores <- cbind(exScores, sampleType)
 exScores
 ```
 
-# Required Inputs
+## Required inputs
 
 You need 2 things to run MIRA:
 
@@ -38,7 +39,7 @@ You need 2 things to run MIRA:
 
 Let's describe each one in more detail:
 
-## DNA Methylation Data
+### DNA methylation data
 
 MIRA requires DNA methylation data after methylation calling. For a given genomic coordinate (the location of the C in a CpG), MIRA needs the methylation level (0 to 1) or data that can be used to calculate this: counts of methylated reads and total reads. The total number of reads can be used for screening purposes. This data should be represented as a `data.table` for each sample, which we call a BSDT (Bisulfite data.table). The BSDT will have these columns: `chr`, `start` (the coordinate of the C of the CpG) and the `methylProp` column with methylation level at each cytosine. Alternatively, the `methylProp` column may be generated from `methylCount` (number of methylated reads) and `coverage` (total number of reads covering this site) columns where `methylProp` = `methylCount`/`coverage` and these columns may be included in the `data.table` in addition to the `methylProp` column. When running MIRA on multiple samples, it is recommended to make each sample an item of a named list, with sample names as the names for the list. Since some existing R packages for DNA methylation use different formats, we include a format conversion function that can be used to convert `SummarizedExperiment`-based objects like you would obtain from the `bsseq`, `methylPipe`, and `BiSeq` packages to the necessary format for MIRA (`SummarizedExperimentToDataTable` function). A `BSseq` object is an acceptable input to MIRA and will be converted to the right format internally but for the sake of parallelization it is recommended that `BSseq` objects be converted to the right format outside the MIRA function so that MIRA can be run on each sample in parallel. Here is an example of a `data.table` in the right format for input to MIRA:
 
@@ -47,22 +48,22 @@ data("exampleBSDT", package="MIRA")
 head(exampleBSDT)
 ```
 
-## Region Sets
+### Region sets
 A region set is a GRanges object containing genomic regions that share a biological annotation. For example, it could be ChIP peaks for a transcription factor. Many types of region sets may be used with MIRA, including ChIP regions for transcription factors or histone modifications, promoters for a set of related genes, sites of motif matches, or DNase hypersensitivity sites. Many such region sets may be found in online repositories and we have pulled together some major sources at http://databio.org/regiondb. At the end of this vignette, we show how to load a database of regions with a few lines of code using the `LOLA` package. You may also want to check out [the `AnnotationHub` Bioconductor package](https://www.bioconductor.org/packages/release/bioc/html/AnnotationHub.html) which gives access to many region sets in an R-friendly format. For use in MIRA, each region set should be a GRanges object and multiple region sets may be passed to MIRA as a GRangesList with each list element being a region set. Here is an example of a region set, which we will use in this vignette:
 
 ```{r, message=FALSE}
 data("exampleRegionSet", package="MIRA")
 head(exampleRegionSet)
 ```
 
-# Analysis Workflow
+## Analysis workflow
 The general workflow is as follows:  
 1. Data inputs: start with single-nucleotide resolution methylation data and one or more sets of genomic regions, as described above.
 2. Expand the regions sizes so that MIRA will be able to get a broad methylation profile surrounding your feature of interest. All regions should be expanded to the same final size.
 3. Aggregate methylation data across regions to get a MIRA profile.
 4. Calculate MIRA score based on shape of MIRA profile.
 
-## The Input
+### The input
 First let's load the necessary libraries and example data. `exampleBSDT` is a data.table containing genomic location, number of reads methylated (`methylCount`), and total read number (`coverage`) from bisulfite sequencing of a lymphoblastoid cell line. `exampleRegionSet` is a GRanges object with regions that share a biological annotation, in this case Nrf1 binding in a different lymphoblastoid cell line.
 
 ```{r, message=FALSE}
@@ -85,7 +86,7 @@ Since our input methylation data did not have a column for proportion of methyla
 BSDTList <- addMethPropCol(BSDTList)
 ```
 
-## Expand Your Regions
+### Expand your regions
 
 A short but important step. MIRA scores are based on the difference between methylation surrounding the regions versus the methylation at the region itself, so we need to expand our regions to include surrounding bases. Let's use the `resize` function from the `GenomicRanges` package to increase the size of our regions and make them all the same size. It is recommended to make all regions the same size so the final methylation profile will be easier to interpret. The average size of our regions is currently about 500:
 
@@ -107,7 +108,7 @@ exampleRegionSetGRL <- GRangesList(exampleRegionSet)
 names(exampleRegionSetGRL) <- "lymphoblastoid_NRF1"
 ```
 
-## Aggregation of Methylation across Regions
+### Aggregating methylation across regions
 Next we aggregate methylation across regions to get a summary methylation profile with the `aggregateMethyl` function. MIRA divides each region into bins, aggregates methylation levels for cytosines within a given bin for each region individually, and then aggregates matching bins over all the regions (all 1st bins together, 2nd bins together, etc.). A couple important parameters: The binNum argument determines how many approximately equally sized bins each region is split into. This could affect the resolution or noisiness of the MIRA profile because using more bins will result in smaller bin sizes (potentially increasing resolution) but also less reads per bin (potentially increasing noise). The minBaseCovPerBin argument in aggregateMethyl is used to screen out region sets that have any bins with less than a minimum coverage. Here we use the default (500). Let's aggregate the methylation and then view the MIRA profile.
 
 ```{r Aggregate_methylation, message=FALSE, warning=FALSE}
@@ -124,7 +125,7 @@ bigBinDT = cbind(bigBinDT, sampleName)
 plotMIRAProfiles(binnedRegDT=bigBinDT)
 ```
 
-## Calculating the MIRA Score
+### Calculating the MIRA score
 To calculate MIRA scores based on the MIRA profiles, we will apply the scoring function, `calcMIRAScore`, to the `data.table` that contains the methylation aggregated in bins. `calcMIRAScore` calculates a score for each group of bins corresponding to a sample and region set, based on the degree of the dip in the profile. With MIRA's default scoring method (`logratio`), `calcMIRAScore` will take the log of the ratio of the outside edges of the dip (identified by `calcMIRAScore`) to the center of the dip. Higher MIRA scores are associated with deeper dips. A flat MIRA profile would have a score of 0.
 
 ```{r Scoring, warning=FALSE}
@@ -134,13 +135,12 @@ sampleScores <- calcMIRAScore(bigBinDT,
 head(sampleScores)
 ```
 
-## A Note on Annotation
-For real uses of MIRA, samples and region sets should be annotated. In order to save memory, it is not recommended that sample name or sample type be included as columns in the BSDT before the aggregation step. An annotation file that matches sample name with sample type (`sampleType` column) can be used to add sample type information to the data.table after aggregating methylation. A sampleType column is not required for the main functions but is used for the plotting functions. See "Applying MIRA to a Biological Question" vignette for a workflow including annotation.
+*A note on annotation*. For real uses of MIRA, samples and region sets should be annotated. In order to save memory, it is not recommended that sample name or sample type be included as columns in the BSDT before the aggregation step. An annotation file that matches sample name with sample type (`sampleType` column) can be used to add sample type information to the data.table after aggregating methylation. A sampleType column is not required for the main functions but is used for the plotting functions. See "Applying MIRA to a Biological Question" vignette for a workflow including annotation.
 
-# Interpreting the Results
+## Interpreting the results
 Regulatory information can be inferred from the MIRA scores and profiles but interpretation depends on the type of region set and samples used. The basic assumption is that deeper dips and higher MIRA scores would be associated with generally higher activity at the regions that were tested, which might also inform you about the activity of the regulatory feature that defines the region set (eg the activity of a transcription factor if using ChIP-seq regions). However, MIRA does not directly tell you the activity of the regulatory feature and, in some cases, higher MIRA scores will not be associated with higher activity of the regulatory feature. For example, samples with higher MIRA scores for a transcription factor binding region set could have more activity of that transcription factor but there could be other causes like increased activity of a different transcription factor that binds to many of the same regions. Additionally, it is possible that the samples with higher scores generally had more similar chromatin states to the samples from which the region set was derived than samples with lower scores did (eg when using MIRA on samples from multiple tissue/cell types and the region set was derived from the same tissue/cell type as some but not all of the samples). The general interpretation rule is that a higher MIRA score means more activity but we encourage the user to think carefully about what biological question they are actually addressing based on the region sets and samples used. For more on interpreting the results, see the "Applying MIRA to a Biological Question" vignette.
 
-# Bonus: Loading Region Sets with LOLA
+## Bonus: Loading region sets with LOLA
 Here is one simple way to get region sets to use with MIRA. First, download the LOLA Core Database (actual files, not the cached version) from [here](http://cloud.databio.org/regiondb/) (this might take a while). 
 Next, let's load the `LOLA` package and use it to load some of the region sets into R!