-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSearch_Visualize_Motif.Rmd
297 lines (263 loc) · 15.1 KB
/
Search_Visualize_Motif.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "ATAC-Seq Data Analysis with Mouse Tissue Data - Search and Visualize Motifs"
author: "Anni Liu"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
html_document:
code_folding: show
---
# Find motifs (set of related, short sequences preferred by a given transcription factor)
* Why identify motifs in ATAC-seq?
+ Motif analysis allows us to predict the potential binding sites for transcription factors (TFs) in the accessible chromatin regions. TFs play a crucial role in gene regulation by binding to specific DNA motifs. Identifying these motifs can help us understand which TFs are likely to be active in a given cell type or condition.
+ ATAC-seq data can reveal regions of open chromatin associated with regulatory elements such as enhancers and promoters. Motif analysis can help in pinpointing the specific motifs that are enriched in these regions. This information can be used to infer the regulatory networks and pathways active in the analyzed samples.
+ Different cell types may have distinct chromatin accessibility patterns and regulatory elements. By analyzing motifs, we can identify cell type-specific motifs that are enriched in open chromatin regions. This can aid in characterizing cell identity and differentiation states.
+ Motif analysis can provide functional annotations to the open chromatin regions. For example, if we find motifs associated with a particular TF known to regulate immune response genes, it suggests that these regions may be involved in immune-related functions.
+ Combining ATAC-seq motif analysis with RNA-seq data can provide a more comprehensive view of gene regulation. We can link the presence of specific motifs in open chromatin regions to changes in gene expression, helping to elucidate the regulatory mechanisms.
* Recommended open-source tool: [Fast Motif Matching in R](https://bioconductor.org/packages/release/bioc/vignettes/motifmatchr/inst/doc/motifmatchr.html)
* Recommended open-source motif database:
[JASPAR](https://jaspar.elixir.no)
[MotifDB](https://bioconductor.org/packages/release/bioc/vignettes/MotifDb/inst/doc/MotifDb.html)
## Structures and operations of motif databases
### MotifDb
```{r}
##----We load MotifDb----
library(MotifDb)
MotifDb
# MotifDb object of length 12657
# | Created from downloaded public sources, last update: 2022-Mar-04
class(MotifDb)
# [1] "MotifList"
# attr(,"package")
# [1] "MotifDb"
length(MotifDb) # [1] 12657
motif_names <- names(MotifDb)
length(motif_names) # [1] 12657
motif_names |> tail()
# [1] "Mmusculus-UniPROBE-Zfp691.UP00095"
# [2] "Mmusculus-UniPROBE-Zfp740.UP00022"
# [3] "Mmusculus-UniPROBE-Zic1.UP00102"
# [4] "Mmusculus-UniPROBE-Zic2.UP00057"
# [5] "Mmusculus-UniPROBE-Zic3.UP00006"
# [6] "Mmusculus-UniPROBE-Zscan4.UP00026"
##----We retrieve the last MotifList----
MotifDb[12657]
# MotifDb object of length 1
# | Created from downloaded public sources, last update: 2022-Mar-04
# | 1 position frequency matrices from 1 source:
# | UniPROBE: 1
# | 1 organism/s
# | Mmusculus: 1
# Mmusculus-UniPROBE-Zscan4.UP00026
##----We retrieve the position probability matrix (PPM); each column represents the position of a motif----
MotifDb[[12657]]
# 1 2 3 4 5 6
# A 0.2039268 0.3603414 0.2511950 0.4871860 0.12283797 0.020466874
# C 0.1572597 0.2652161 0.2988057 0.1224721 0.05106196 0.009991709
# G 0.3070709 0.1473273 0.2410505 0.2143623 0.05577333 0.965816157
# T 0.3317426 0.2271152 0.2089488 0.1759796 0.77032674 0.003725260
# 7 8 9 10 11 12
# A 0.005886533 0.030656083 0.002078162 0.960642728 0.003725260 0.77032674
# C 0.026662635 0.002166648 0.965099108 0.006808103 0.965816157 0.05577333
# G 0.006808103 0.965099108 0.002166648 0.026662635 0.009991709 0.05106196
# T 0.960642728 0.002078162 0.030656083 0.005886533 0.020466874 0.12283797
# 13 14 15 16 17
# A 0.04480760 0.75131969 0.36274223 0.4366353 0.3039305
# C 0.38230731 0.04741718 0.22889829 0.1114793 0.2853735
# G 0.04292044 0.04448209 0.08537279 0.2172844 0.1958724
# T 0.52996466 0.15678104 0.32298669 0.2346011 0.2148237
colSums(MotifDb[[12657]])
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##----We retrieve the motif metadata information----
values(MotifDb) # DataFrame with 12657 rows and 15 columns
# ! geneSymbol shows the gene name of transcription factor associated with each motif
values(MotifDb)[12657, ]
# DataFrame with 1 row and 15 columns
# providerName providerId dataSource geneSymbol geneId geneIdType proteinId proteinIdType organism sequenceCount bindingSequence bindingDomain tfFamily experimentType pubmedID
# <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character>
# Mmusculus-UniPROBE-Zscan4.UP00026 SCI09/Zscan4_pwm_pri.. UP00026 UniPROBE Zscan4 NA NA NA NA Mmusculus NA NA ZnF_C2H2 NA protein binding micr.. 19443739
##----We retrieve the MotifList by restricting the metadata information----
# Only select MotifList whose metadata includes FOX
(fox <- query(object = MotifDb,
andStrings = c("FOX"))) # CTCF must be found in the metadata
# MotifDb object of length 353
# | Created from downloaded public sources, last update: 2022-Mar-04
# | 353 position frequency matrices from 17 sources:
# | FlyFactorSurvey: 1
# | HOCOMOCOv10: 47
# | HOCOMOCOv11-core-A: 8
# | HOCOMOCOv11-core-B: 3
# | HOCOMOCOv11-core-C: 5
# | HOCOMOCOv11-secondary-B: 1
# | HOCOMOCOv11-secondary-D: 11
# | HOMER: 14
# | hPDI: 4
# | JASPAR_2014: 14
# | JASPAR_CORE: 10
# | jaspar2016: 40
# | jaspar2018: 43
# | jaspar2022: 68
# | jolma2013: 51
# | SwissRegulon: 28
# | UniPROBE: 5
# | 10 organism/s
# | Hsapiens: 224
# | Mmusculus: 57
# | NA: 20
# | Scerevisiae: 12
# | Rnorvegicus: 11
# | Dmelanogaster: 7
# | other: 22
# Dmelanogaster-FlyFactorSurvey-foxo_SANGER_10_FBgn0038197
# Hsapiens-HOCOMOCOv10-FOXA1_HUMAN.H10MO.A
# Hsapiens-HOCOMOCOv10-FOXA2_HUMAN.H10MO.A
# Hsapiens-HOCOMOCOv10-FOXA3_HUMAN.H10MO.C
# Hsapiens-HOCOMOCOv10-FOXB1_HUMAN.H10MO.D
# ...
# Mmusculus-UniPROBE-Foxa2.UP00073
# Mmusculus-UniPROBE-Foxj1.UP00041
# Mmusculus-UniPROBE-Foxj3.UP00039
# Mmusculus-UniPROBE-Foxk1.UP00025
# Mmusculus-UniPROBE-Foxl1.UP00061
values(fox)[1, ]
# DataFrame with 1 row and 15 columns
# providerName providerId dataSource geneSymbol geneId geneIdType proteinId proteinIdType organism sequenceCount bindingSequence bindingDomain tfFamily
# <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character>
# Dmelanogaster-FlyFactorSurvey-foxo_SANGER_10_FBgn0038197 foxo_SANGER_10_FBgn0.. FBgn0038197 FlyFactorSurvey foxo 41709 ENTREZ Q95V55 UNIPROT Dmelanogaster 20 NA Fork_head NA
# experimentType pubmedID
# <character> <character>
# Dmelanogaster-FlyFactorSurvey-foxo_SANGER_10_FBgn0038197 bacterial 1-hybrid, .. NA
# Only select MotifList whose metadata includes FOX, hsapiens, jaspar2022
(fox_2 <- query(object = MotifDb,
andStrings = c("FOX", "hsapiens", "jaspar2022")))
# MotifDb object of length 34
# | Created from downloaded public sources, last update: 2022-Mar-04
# | 34 position frequency matrices from 1 source:
# | jaspar2022: 34
# | 1 organism/s
# | Hsapiens: 34
# Hsapiens-jaspar2022-FOXF2-MA0030.1
# Hsapiens-jaspar2022-FOXD1-MA0031.1
# Hsapiens-jaspar2022-FOXH1-MA0479.1
# Hsapiens-jaspar2022-FOXP2-MA0593.1
# Hsapiens-jaspar2022-FOXG1-MA0613.1
# ...
# Hsapiens-jaspar2022-FOXO1::ELK3-MA1955.1
# Hsapiens-jaspar2022-FOXO1::FLI1-MA1956.1
# Hsapiens-jaspar2022-FOXD3-MA0041.2
# Hsapiens-jaspar2022-FOXD2-MA0847.3
# Hsapiens-jaspar2022-FOXE1-MA1487.2
values(fox_2)[1, ]
# DataFrame with 1 row and 15 columns
# providerName providerId dataSource geneSymbol geneId geneIdType proteinId proteinIdType organism sequenceCount bindingSequence bindingDomain tfFamily experimentType pubmedID
# <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character> <character>
# Hsapiens-jaspar2022-FOXF2-MA0030.1 MA0030.1 MA0030.1 jaspar2022 FOXF2 NA NA NA NA Hsapiens 28 NA NA FOX SELEX 7957066
```
### [JASPAR](https://jaspar.elixir.no)
[JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles](https://academic.oup.com/nar/article/50/D1/D165/6446529)
The JASPAR packages are updated more frequently, and may include motifs not captured in the MotifDb package.
```{r}
##----We load JASPAR2022 database----
BiocManager::install(version = "3.18")
devtools::install_github("da-bar/JASPAR2022")
library(JASPAR2022)
##----We use TFBStools to access and manipulate information from JASPAR2022----
library(TFBSTools)
?getMatrixSet # This function fetches matrix data for all matrices in the database matching criteria defined by the named arguments and returns a PFMatrixList object
?getMatrixByID # This method fetches matrix data under the given ID or name from the database and returns a XMatrix object.
?getMatrixByName # This method fetches matrix data under the given ID or name from the database and returns a XMatrix object.
hoxb1_mat <- getMatrixByName(x = JASPAR2022, name = "HOXB1") # name: transcription factor name; https://en.wikipedia.org/wiki/List_of_human_transcription_factors
class(hoxb1_mat)
# [1] "PFMatrix" # This is different from the common matrix in R
# attr(,"package")
# [1] "TFBSTools"
ID(hoxb1_mat) # "UN0124.1"
spi1_mat <- getMatrixByID(x = JASPAR2022, ID = "MA0080.1") # Select the exact motif ID of HOXB1; sourced from https://jaspar.elixir.no
ID(spi1_mat) # "MA0080.1"
##----We retrieve the position frequency matrix----
(mat <- Matrix(spi1_mat))
#
# [,1] [,2] [,3] [,4] [,5] [,6]
# A 14 4 3 56 56 3
# C 21 2 0 1 0 18
# G 19 48 52 0 0 34
# T 3 3 2 0 1 2
mat_2 <- as.matrix(spi1_mat)
mat == mat_2
# [,1] [,2] [,3] [,4] [,5] [,6]
# A TRUE TRUE TRUE TRUE TRUE TRUE
# C TRUE TRUE TRUE TRUE TRUE TRUE
# G TRUE TRUE TRUE TRUE TRUE TRUE
# T TRUE TRUE TRUE TRUE TRUE TRUE
# ! What do those numbers in the mat?
# Among 57 motifs, 21 of those motifs have the first position carrying the nucleotide C. The probability of C in the first position = 21/57
##----We transform a position frequency matrix into a position probability matrix----
mat/colSums(mat)
# [,1] [,2] [,3] [,4] [,5] [,6]
# A 0.24561404 0.07017544 0.05263158 0.98245614 0.98245614 0.05263158
# C 0.36842105 0.03508772 0.00000000 0.01754386 0.00000000 0.31578947
# G 0.33333333 0.84210526 0.91228070 0.00000000 0.00000000 0.59649123
# T 0.05263158 0.05263158 0.03508772 0.00000000 0.01754386 0.03508772
##----We retrieve sets of motifs----
# opts: a search options list.
# collection=c("CORE", "CNE", "PHYLOFACTS", "SPLICE", "POLII", "FAM", "PBM", "PBM_HOMEO", "PBM_HLH", "UNVALIDATED")
# species: The species source for the sequences, in Latin (Homo sapiens) or NCBI tax IDs (9606).
# matrixtype=c("PFM", "PWM", "ICM")
# tax_group: Group of species, currently consisting of "plants", "vertebrates", "insects", "urochordat", "nematodes", "fungi".
opts <- list()
opts[["collection"]] <- "CORE"
opts[["tax_group"]] <- "vertebrates"
(motif_list <- getMatrixSet(x=JASPAR2022, opts=opts))
# PFMatrixList of length 841
# names(841): MA0004.1 MA0006.1 MA0019.1 MA0029.1 ... MA1633.2 MA1647.2 MA0597.2 MA1540.2
```
# Visualize motifs from 2 motif databases
A `seqLogo` illustrates the relative frequency of a base at each motif position by representing the base's size relative to other bases at that particular position. The larger a base is at a position, the more likely this position carries this base.
## MotifDb
```{r}
library(seqLogo)
fox[[1]] # Return a position probability matrix (PPM) per base
# 1 2 3 4 5 6 7 8 9
# A 0 0.1 0 0.0 0.15 0.6 0.05 0.10 0.55
# C 0 0.0 0 0.0 0.00 0.0 0.55 0.15 0.15
# G 0 0.9 0 0.2 0.05 0.0 0.00 0.60 0.30
# T 1 0.0 1 0.8 0.80 0.4 0.40 0.15 0.00
seqLogo::seqLogo(pwm = fox[[1]], ic.scale = F) # ic.scale = F -> show the probability
seqLogo::seqLogo(pwm = fox[[1]], ic.scale = T) # Recommend ic.scale = T -> easily identify the important bases: positions with equal probabilities for each base A/T/C/G will score 0, while positions with only 1 possible base will score 2.
# Alternative approach 2 - can customize the seqlogo graph under the ggplot2 framework
# More info: https://omarwagih.github.io/ggseqlogo/
devtools::install_github("omarwagih/ggseqlogo")
library(ggseqlogo)
library(ggplot2)
ggseqlogo(fox[[1]], method = 'bits') + # position probability matrix
theme_minimal()
ggseqlogo(fox[[1]], method = 'prob') + # position probability matrix
theme_minimal()
```
## JASPAR
```{r}
# Covert the position frequency matrix to the position probability matrix
(spi1_mat_prob <- mat/colSums(mat))
# [,1] [,2] [,3] [,4] [,5] [,6]
# A 0.24561404 0.07017544 0.05263158 0.98245614 0.98245614 0.05263158
# C 0.36842105 0.03508772 0.00000000 0.01754386 0.00000000 0.31578947
# G 0.33333333 0.84210526 0.91228070 0.00000000 0.00000000 0.59649123
# T 0.05263158 0.05263158 0.03508772 0.00000000 0.01754386 0.03508772
seqLogo::seqLogo(pwm = spi1_mat_prob, ic.scale = F)
seqLogo::seqLogo(pwm = spi1_mat_prob, ic.scale = T)
# Alternative approach 1: the resulting visuals are the same
spi1_mat_icm <- toICM(mat) # Convert a raw frequency matrix (PFMatrix) to a information content matrix (ICMatrix).
TFBSTools::seqLogo(x = spi1_mat_icm, ic.scale = F)
TFBSTools::seqLogo(x = spi1_mat_icm, ic.scale = T)
# Alternative approach 2 - can customize the seqlogo graph under the ggplot2 framework
# The following 2 output graph are the same
ggseqlogo(mat, method = 'prob') + # position frequency matrix
theme_minimal()
ggseqlogo(spi1_mat_prob, method = 'prob') + # position probability matrix
theme_minimal()
# The following 2 output graph are the same
ggseqlogo(mat, method = 'bits') + # position frequency matrix
theme_minimal()
ggseqlogo(spi1_mat_prob, method = 'bits') + # position probability matrix
theme_minimal()
```