ATACSeq/Search_Visualize_Motif.Rmd

---
title: "ATAC-Seq Data Analysis with Mouse Tissue Data - Search and Visualize Motifs"
author: "Anni Liu"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output: 
  html_document:
    code_folding: show
---

# Find motifs (set of related, short sequences preferred by a given transcription factor)
* Why identify motifs in ATAC-seq?
  + Motif analysis allows us to predict the potential binding sites for transcription factors (TFs) in the accessible chromatin regions. TFs play a crucial role in gene regulation by binding to specific DNA motifs. Identifying these motifs can help us understand which TFs are likely to be active in a given cell type or condition.
  + ATAC-seq data can reveal regions of open chromatin associated with regulatory elements such as enhancers and promoters. Motif analysis can help in pinpointing the specific motifs that are enriched in these regions. This information can be used to infer the regulatory networks and pathways active in the analyzed samples.
  + Different cell types may have distinct chromatin accessibility patterns and regulatory elements. By analyzing motifs, we can identify cell type-specific motifs that are enriched in open chromatin regions. This can aid in characterizing cell identity and differentiation states.
  + Motif analysis can provide functional annotations to the open chromatin regions. For example, if we find motifs associated with a particular TF known to regulate immune response genes, it suggests that these regions may be involved in immune-related functions.
  + Combining ATAC-seq motif analysis with RNA-seq data can provide a more comprehensive view of gene regulation. We can link the presence of specific motifs in open chromatin regions to changes in gene expression, helping to elucidate the regulatory mechanisms.
* Recommended open-source tool: [Fast Motif Matching in R](https://bioconductor.org/packages/release/bioc/vignettes/motifmatchr/inst/doc/motifmatchr.html)
* Recommended open-source motif database:
[JASPAR](https://jaspar.elixir.no)
[MotifDB](https://bioconductor.org/packages/release/bioc/vignettes/MotifDb/inst/doc/MotifDb.html)


## Structures and operations of motif databases
### MotifDb
```{r}
##----We load MotifDb----
library(MotifDb)
MotifDb 
# MotifDb object of length 12657
# | Created from downloaded public sources, last update: 2022-Mar-04
class(MotifDb)
# [1] "MotifList"
# attr(,"package")
# [1] "MotifDb"
length(MotifDb) # [1] 12657
motif_names <- names(MotifDb)
length(motif_names) # [1] 12657
motif_names |> tail()
# [1] "Mmusculus-UniPROBE-Zfp691.UP00095"
# [2] "Mmusculus-UniPROBE-Zfp740.UP00022"
# [3] "Mmusculus-UniPROBE-Zic1.UP00102"  
# [4] "Mmusculus-UniPROBE-Zic2.UP00057"  
# [5] "Mmusculus-UniPROBE-Zic3.UP00006"  
# [6] "Mmusculus-UniPROBE-Zscan4.UP00026"

##----We retrieve the last MotifList----
MotifDb[12657]
# MotifDb object of length 1
# | Created from downloaded public sources, last update: 2022-Mar-04
# | 1 position frequency matrices from 1 source:
# |           UniPROBE:    1
# | 1 organism/s
# |          Mmusculus:    1
# Mmusculus-UniPROBE-Zscan4.UP00026 

##----We retrieve the position probability matrix (PPM); each column represents the position of a motif----
MotifDb[[12657]]
#           1         2         3         4          5           6
# A 0.2039268 0.3603414 0.2511950 0.4871860 0.12283797 0.020466874
# C 0.1572597 0.2652161 0.2988057 0.1224721 0.05106196 0.009991709
# G 0.3070709 0.1473273 0.2410505 0.2143623 0.05577333 0.965816157
# T 0.3317426 0.2271152 0.2089488 0.1759796 0.77032674 0.003725260
#             7           8           9          10          11         12
# A 0.005886533 0.030656083 0.002078162 0.960642728 0.003725260 0.77032674
# C 0.026662635 0.002166648 0.965099108 0.006808103 0.965816157 0.05577333
# G 0.006808103 0.965099108 0.002166648 0.026662635 0.009991709 0.05106196
# T 0.960642728 0.002078162 0.030656083 0.005886533 0.020466874 0.12283797
#           13         14         15        16        17
# A 0.04480760 0.75131969 0.36274223 0.4366353 0.3039305
# C 0.38230731 0.04741718 0.22889829 0.1114793 0.2853735
# G 0.04292044 0.04448209 0.08537279 0.2172844 0.1958724
# T 0.52996466 0.15678104 0.32298669 0.2346011 0.2148237

colSums(MotifDb[[12657]])
 # 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 
 # 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 

##----We retrieve the motif metadata information----
values(MotifDb) # DataFrame with 12657 rows and 15 columns
# ! geneSymbol shows the gene name of transcription factor associated with each motif

values(MotifDb)[12657, ]
# DataFrame with 1 row and 15 columns
#                                             providerName  providerId  dataSource  geneSymbol      geneId  geneIdType   proteinId proteinIdType    organism sequenceCount bindingSequence bindingDomain    tfFamily         experimentType    pubmedID
#                                              <character> <character> <character> <character> <character> <character> <character>   <character> <character>   <character>     <character>   <character> <character>            <character> <character>
# Mmusculus-UniPROBE-Zscan4.UP00026 SCI09/Zscan4_pwm_pri..     UP00026    UniPROBE      Zscan4          NA          NA          NA            NA   Mmusculus            NA              NA      ZnF_C2H2          NA protein binding micr..    19443739

##----We retrieve the MotifList by restricting the metadata information----
# Only select MotifList whose metadata includes FOX
(fox <- query(object = MotifDb, 
              andStrings = c("FOX"))) # CTCF must be found in the metadata

# MotifDb object of length 353
# | Created from downloaded public sources, last update: 2022-Mar-04
# | 353 position frequency matrices from 17 sources:
# |    FlyFactorSurvey:    1
# |        HOCOMOCOv10:   47
# | HOCOMOCOv11-core-A:    8
# | HOCOMOCOv11-core-B:    3
# | HOCOMOCOv11-core-C:    5
# | HOCOMOCOv11-secondary-B:    1
# | HOCOMOCOv11-secondary-D:   11
# |              HOMER:   14
# |               hPDI:    4
# |        JASPAR_2014:   14
# |        JASPAR_CORE:   10
# |         jaspar2016:   40
# |         jaspar2018:   43
# |         jaspar2022:   68
# |          jolma2013:   51
# |       SwissRegulon:   28
# |           UniPROBE:    5
# | 10 organism/s
# |           Hsapiens:  224
# |          Mmusculus:   57
# |                 NA:   20
# |        Scerevisiae:   12
# |        Rnorvegicus:   11
# |      Dmelanogaster:    7
# |              other:   22
# Dmelanogaster-FlyFactorSurvey-foxo_SANGER_10_FBgn0038197 
# Hsapiens-HOCOMOCOv10-FOXA1_HUMAN.H10MO.A 
# Hsapiens-HOCOMOCOv10-FOXA2_HUMAN.H10MO.A 
# Hsapiens-HOCOMOCOv10-FOXA3_HUMAN.H10MO.C 
# Hsapiens-HOCOMOCOv10-FOXB1_HUMAN.H10MO.D 
# ...
# Mmusculus-UniPROBE-Foxa2.UP00073 
# Mmusculus-UniPROBE-Foxj1.UP00041 
# Mmusculus-UniPROBE-Foxj3.UP00039 
# Mmusculus-UniPROBE-Foxk1.UP00025 
# Mmusculus-UniPROBE-Foxl1.UP00061 

values(fox)[1, ]
# DataFrame with 1 row and 15 columns
#                                                                    providerName  providerId      dataSource  geneSymbol      geneId  geneIdType   proteinId proteinIdType      organism sequenceCount bindingSequence bindingDomain    tfFamily
#                                                                     <character> <character>     <character> <character> <character> <character> <character>   <character>   <character>   <character>     <character>   <character> <character>
# Dmelanogaster-FlyFactorSurvey-foxo_SANGER_10_FBgn0038197 foxo_SANGER_10_FBgn0.. FBgn0038197 FlyFactorSurvey        foxo       41709      ENTREZ      Q95V55       UNIPROT Dmelanogaster            20              NA     Fork_head          NA
#                                                                  experimentType    pubmedID
#                                                                     <character> <character>
# Dmelanogaster-FlyFactorSurvey-foxo_SANGER_10_FBgn0038197 bacterial 1-hybrid, ..          NA


# Only select MotifList whose metadata includes FOX, hsapiens, jaspar2022
(fox_2 <- query(object = MotifDb, 
                andStrings = c("FOX", "hsapiens", "jaspar2022")))

# MotifDb object of length 34
# | Created from downloaded public sources, last update: 2022-Mar-04
# | 34 position frequency matrices from 1 source:
# |         jaspar2022:   34
# | 1 organism/s
# |           Hsapiens:   34
# Hsapiens-jaspar2022-FOXF2-MA0030.1 
# Hsapiens-jaspar2022-FOXD1-MA0031.1 
# Hsapiens-jaspar2022-FOXH1-MA0479.1 
# Hsapiens-jaspar2022-FOXP2-MA0593.1 
# Hsapiens-jaspar2022-FOXG1-MA0613.1 
# ...
# Hsapiens-jaspar2022-FOXO1::ELK3-MA1955.1 
# Hsapiens-jaspar2022-FOXO1::FLI1-MA1956.1 
# Hsapiens-jaspar2022-FOXD3-MA0041.2 
# Hsapiens-jaspar2022-FOXD2-MA0847.3 
# Hsapiens-jaspar2022-FOXE1-MA1487.2 

values(fox_2)[1, ]
# DataFrame with 1 row and 15 columns
#                                    providerName  providerId  dataSource  geneSymbol      geneId  geneIdType   proteinId proteinIdType    organism sequenceCount bindingSequence bindingDomain    tfFamily experimentType    pubmedID
#                                     <character> <character> <character> <character> <character> <character> <character>   <character> <character>   <character>     <character>   <character> <character>    <character> <character>
# Hsapiens-jaspar2022-FOXF2-MA0030.1     MA0030.1    MA0030.1  jaspar2022       FOXF2          NA          NA          NA            NA    Hsapiens            28              NA            NA         FOX          SELEX     7957066
```


### [JASPAR](https://jaspar.elixir.no)
[JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles](https://academic.oup.com/nar/article/50/D1/D165/6446529)
The JASPAR packages are updated more frequently, and may include motifs not captured in the MotifDb package. 
```{r}
##----We load JASPAR2022 database----
BiocManager::install(version = "3.18")
devtools::install_github("da-bar/JASPAR2022")
library(JASPAR2022)

##----We use TFBStools to access and manipulate information from JASPAR2022----
library(TFBSTools)
?getMatrixSet # This function fetches matrix data for all matrices in the database matching criteria defined by the named arguments and returns a PFMatrixList object
?getMatrixByID # This method fetches matrix data under the given ID or name from the database and returns a XMatrix object.
?getMatrixByName # This method fetches matrix data under the given ID or name from the database and returns a XMatrix object.
hoxb1_mat <- getMatrixByName(x = JASPAR2022, name = "HOXB1") # name: transcription factor name; https://en.wikipedia.org/wiki/List_of_human_transcription_factors
class(hoxb1_mat)
# [1] "PFMatrix" # This is different from the common matrix in R
# attr(,"package")
# [1] "TFBSTools"
ID(hoxb1_mat) # "UN0124.1"

spi1_mat <- getMatrixByID(x = JASPAR2022, ID = "MA0080.1") # Select the exact motif ID of HOXB1; sourced from https://jaspar.elixir.no
ID(spi1_mat) # "MA0080.1"

##----We retrieve the position frequency matrix----
(mat <- Matrix(spi1_mat))
# 
#   [,1] [,2] [,3] [,4] [,5] [,6]
# A   14    4    3   56   56    3
# C   21    2    0    1    0   18
# G   19   48   52    0    0   34
# T    3    3    2    0    1    2

mat_2 <- as.matrix(spi1_mat)
mat == mat_2
#   [,1] [,2] [,3] [,4] [,5] [,6]
# A TRUE TRUE TRUE TRUE TRUE TRUE
# C TRUE TRUE TRUE TRUE TRUE TRUE
# G TRUE TRUE TRUE TRUE TRUE TRUE
# T TRUE TRUE TRUE TRUE TRUE TRUE

# ! What do those numbers in the mat?
# Among 57 motifs, 21 of those motifs have the first position carrying the nucleotide C. The probability of C in the first position = 21/57

##----We transform a position frequency matrix into a position probability matrix----
mat/colSums(mat)
#         [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
# A 0.24561404 0.07017544 0.05263158 0.98245614 0.98245614 0.05263158
# C 0.36842105 0.03508772 0.00000000 0.01754386 0.00000000 0.31578947
# G 0.33333333 0.84210526 0.91228070 0.00000000 0.00000000 0.59649123
# T 0.05263158 0.05263158 0.03508772 0.00000000 0.01754386 0.03508772

##----We retrieve sets of motifs----
# opts: a search options list. 
# collection=c("CORE", "CNE", "PHYLOFACTS", "SPLICE", "POLII", "FAM", "PBM", "PBM_HOMEO", "PBM_HLH", "UNVALIDATED")
# species: The species source for the sequences, in Latin (Homo sapiens) or NCBI tax IDs (9606).
# matrixtype=c("PFM", "PWM", "ICM")
# tax_group: Group of species, currently consisting of "plants", "vertebrates", "insects", "urochordat", "nematodes", "fungi".
opts <- list()
opts[["collection"]] <- "CORE"
opts[["tax_group"]] <- "vertebrates"
(motif_list <- getMatrixSet(x=JASPAR2022, opts=opts))
# PFMatrixList of length 841
# names(841): MA0004.1 MA0006.1 MA0019.1 MA0029.1 ... MA1633.2 MA1647.2 MA0597.2 MA1540.2
```


# Visualize motifs from 2 motif databases
A `seqLogo` illustrates the relative frequency of a base at each motif position by representing the base's size relative to other bases at that particular position. The larger a base is at a position, the more likely this position carries this base.

## MotifDb
```{r}
library(seqLogo)
fox[[1]] # Return a position probability matrix (PPM) per base
#   1   2 3   4    5   6    7    8    9
# A 0 0.1 0 0.0 0.15 0.6 0.05 0.10 0.55
# C 0 0.0 0 0.0 0.00 0.0 0.55 0.15 0.15
# G 0 0.9 0 0.2 0.05 0.0 0.00 0.60 0.30
# T 1 0.0 1 0.8 0.80 0.4 0.40 0.15 0.00
seqLogo::seqLogo(pwm = fox[[1]], ic.scale = F) # ic.scale = F -> show the probability
seqLogo::seqLogo(pwm = fox[[1]], ic.scale = T) # Recommend ic.scale = T -> easily identify the important bases: positions with equal probabilities for each base A/T/C/G will score 0, while positions with only 1 possible base will score 2. 

# Alternative approach 2 - can customize the seqlogo graph under the ggplot2 framework
# More info: https://omarwagih.github.io/ggseqlogo/
devtools::install_github("omarwagih/ggseqlogo")
library(ggseqlogo)
library(ggplot2)
ggseqlogo(fox[[1]], method = 'bits') + # position probability matrix
  theme_minimal()
ggseqlogo(fox[[1]], method = 'prob') + # position probability matrix
  theme_minimal()
```


## JASPAR
```{r}
# Covert the position frequency matrix to the position probability matrix
(spi1_mat_prob <- mat/colSums(mat))
#         [,1]       [,2]       [,3]       [,4]       [,5]       [,6]
# A 0.24561404 0.07017544 0.05263158 0.98245614 0.98245614 0.05263158
# C 0.36842105 0.03508772 0.00000000 0.01754386 0.00000000 0.31578947
# G 0.33333333 0.84210526 0.91228070 0.00000000 0.00000000 0.59649123
# T 0.05263158 0.05263158 0.03508772 0.00000000 0.01754386 0.03508772

seqLogo::seqLogo(pwm = spi1_mat_prob, ic.scale = F) 
seqLogo::seqLogo(pwm = spi1_mat_prob, ic.scale = T)

# Alternative approach 1: the resulting visuals are the same
spi1_mat_icm <- toICM(mat) # Convert a raw frequency matrix (PFMatrix) to a information content matrix (ICMatrix).
TFBSTools::seqLogo(x = spi1_mat_icm, ic.scale = F)
TFBSTools::seqLogo(x = spi1_mat_icm, ic.scale = T)

# Alternative approach 2 - can customize the seqlogo graph under the ggplot2 framework
# The following 2 output graph are the same
ggseqlogo(mat, method = 'prob') + # position frequency matrix 
  theme_minimal()
ggseqlogo(spi1_mat_prob, method = 'prob') + # position probability matrix
  theme_minimal()

# The following 2 output graph are the same
ggseqlogo(mat, method = 'bits') + # position frequency matrix
  theme_minimal()
ggseqlogo(spi1_mat_prob, method = 'bits') + # position probability matrix
  theme_minimal()
```