BulkRNASeq/GeneSetTCell.Rmd

---
title: "Gene Set Enrichment Analysis Using Mouse T-Regulatory Cell RNA-Seq Data"
author: "Anni Liu"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output: 
  html_document:
    code_folding: show
---

```{r, shorcut, include=FALSE}
## RStudio keyboard shortcut
# Cursor at the beginning of a command line: Ctrl+A
# Cursor at the end of a command line: Ctrl+E
# Clear all the code from your console: Ctrl+L
# Create a pipe operator %>%: Ctrl+Shift+M (Windows) or Cmd+Shift+M (Mac)
# Create an assignment operator <-: Alt+- (Windows) or Option+-(Mac) 
# Knit a document (knitr): Ctrl+Shift+K (Windows) or Cmd+Shift+K (Mac)
# Comment or uncomment current selection: Ctrl+Shift+C (Windows) or Cmd+Shift+C (Mac)
```


# Load and save images
```{r}
load("2023Feb17RNAseq_GeneSet_ALiu.RData")
```

```{r}
image.date <- format(Sys.Date(), "%Y%b%d")
save.image(file = paste0(image.date, "RNAseq_GeneSet_ALiu.RData"))
```


# Tutorial
# Questions of interest
* Do cell cycle genes change more [enrich] between conditions than other genes?
* Are genes related to "immune response" enriched in any clusters?


# Sources of gene sets
* [GO consortium](http://geneontology.org)
* [REACTOME](https://reactome.org)
* [KEGG](https://www.genome.jp/kegg/)
* [MsigDB](https://www.gsea-msigdb.org/gsea/msigdb/)


# MSigDB and gene set collections
## Read the hallmarks gene set with gene symbols using `getGmt()`
```{r}
library(GSEABase)
hallMarks <- getGmt(con = "./data/h.all.v7.1.symbols.gmt")
hallMarks
class(hallMarks) # GeneSetCollection; list-style
names(hallMarks) # Access the names of all gene sets
hallMarks[[1]] # Read the first gene set which has 200 genes
```


## List the gene symbols within a gene set using `geneIds`
```{r}
geneIds(hallMarks)[1] # Show the gene symbols affiliated with the first gene set
```


## Work with `{msigdbr}`
```{r}
library(msigdbr)
mm_H <- msigdbr(species = "Mus musculus", category = "H") # Extract all the mouse hallmark gene sets
# species: Species name, such as Homo sapiens or Mus musculus.
# category: MSigDB collection abbreviation, such as H or C1.
# subcategory: MSigDB sub-collection abbreviation, such as CGP (chemical and genetic pertubations) or BP (biological process).
# https://www.gsea-msigdb.org/gsea/msigdb/human/genesets.jsp
# Note that these mouse gene sets are computationally derived from orthologous analysis; they may not be perfect

head(mm_H)
```


# *Test gene set enrichment
* First method - functional enrichment: test any association of our gene set with our group of differentially expressed genes 
* Second method - Gene Set Enrichment Analysis (GSEA): test any association of our gene set with the `ranking` of all our genes [not merely differentially expressed genes] 
  + Why GSEA?
  + To study gene expression profiles, researchers commonly identify interesting genes that show different levels of expression. 
  + The functional enrichment analysis method focuses on these differentially expressed genes, but it may not work well in cases where the difference in expression levels is small but coordinated among related genes. 
  + To overcome this limitation, GSEA can be used, which allows for the use of all genes and aggregates the per gene statistics across genes within a gene set to detect small but consistent changes in gene sets. 
  + This is important because small but coordinated changes in a group of genes are likely to be relevant to many phenotypic differences.


# Functional enrichment
## Load the differential expression data
```{r}
Activated_minus_Resting <- read.csv("./data/Group_Activated_minus_Resting.csv")
Activated_minus_Resting[1:6, ]
```


## Screen out NA in `padj`
```{r}
Activated_minus_Resting <- Activated_minus_Resting[!is.na(Activated_minus_Resting$padj), ]
Activated_minus_Resting[1:6, ]
```


## *Perform the functional enrichment using `goseq`
```{r}
# Create a named vector of 1s or 0s indicating if a gene is upregulated or downregulated 
UpInAct <- (Activated_minus_Resting$padj < 0.05 & Activated_minus_Resting$log2FoldChange > 0) |> as.integer()
names(UpInAct) <- Activated_minus_Resting$ENTREZID
UpInAct[1:6]
table(UpInAct)

# List supported genomes
# BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(goseq)
supGenomes <- supportedGenomes() 
supGenomes[1:6, ]

# [!I mportant step]Examine the gene length bias in differential expression analysis and remove any artefactual enrichment for long and short genes 
nullp(DEgenes = UpInAct, genome = "mm10", id = "knownGene", plot.fit = T)
pwf <- nullp(DEgenes = UpInAct, genome = "mm10", id = "knownGene", plot.fit = T)

# What can we learn from the plot?
# If there is no bias in differential expression analysis caused by gene length, the line will be horizontal.
# In this plot, the line is not horizontal but bent, indicating the gene length bias in the dataset.

# Identify the top enriched functional terms in GO 
# Specify the genome build and ID (here we match TxDb.UCSC.mm10.knownGene.db) and the categories we plan to test (GO:BP, GO:MF, GO:CC, KEGG). Unfortunately, goseq does not contain the gene set in the msigdbr.
library(org.Mm.eg.db)
GO_UpInAct <- goseq(pwf, genome = "mm10", id = "knownGene",
                    test.cats = c("GO:BP"))
GO_UpInAct[1:6, ]

# "Wallenius" approximates the true distribution of numbers of members of a category amongst DE genes by the Wallenius non-central hypergeometric distribution. This distribution assumes that within a category all genes have the same probability of being chosen. Therefore, this approximation works best when the range in probabilities obtained by the probability weighting function is small. "Wallenius" is the recommended method for calculating p-values.

# Retrieve the genes in the immune response functional group (GO:0006955)
library(org.Mm.eg.db)
keytypes(org.Mm.eg.db) # Equivalent
#  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
#  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
#  [9] "EVIDENCEALL"  "GENENAME"     "GENETYPE"     "GO"          
# [13] "GOALL"        "IPI"          "MGI"          "ONTOLOGY"    
# [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
# [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIPROT"     
keys(org.Mm.eg.db, keytype = "GO")
ImmuneResponseGenes <- AnnotationDbi::select(x = org.Mm.eg.db, keytype = "GOALL", keys = "GO:0002376", columns = c("ENTREZID", "ENSEMBL", "SYMBOL"))
ImmuneResponseGenes # Show the gene set of immune response

# Subset the differential gene expression table using the genes in the immune response functional group (GO:0002376)
IRG_Entrez <- unique(ImmuneResponseGenes$ENTREZID)
IRG_Res <-  Activated_minus_Resting[Activated_minus_Resting$ENTREZID %in% IRG_Entrez, ]
write.csv(IRG_Res,
          file = "./data/ImmuneResponseDEGeneTable.csv",
          row.names = F)
```


## ClusterProfiler
### Perform the functional enrichment
```{r}
library(org.Mm.eg.db)
library(clusterProfiler)
sig_genes <- Activated_minus_Resting[!is.na(Activated_minus_Resting$padj) & Activated_minus_Resting$padj < 0.05, 1]
sig_genes[1:6]
sig_gene_enr <- enrichGO(sig_genes, OrgDb = org.Mm.eg.db)
sig_gene_enr 
sig_gene_enr@result |> View()
class(sig_gene_enr) # enrichResult
sig_gene_enr_res <- sig_gene_enr |> as.data.frame()
write.csv(sig_gene_enr_res,
          file = "./data/sig_gene_enr_res.csv",
          row.names = F)
```


### *Visualize the results
```{r}
# Dotplot
library(ggplot2)
clusterProfiler::dotplot(sig_gene_enr, showCategory = 8) +
  theme(axis.text.y = element_text(size = 7)) # Make the GO term description smaller

# What can we learn from the plot?
# GeneRatio on the X-axis: proportion of genes significantly changed between the activated status and resting status which are in the gene set among all genes significantly changed between the activated status and resting status, given that these significantly changed genes have the annotations
# Count: number of genes in the gene set that are significantly changed between the activated status and resting status
# p.adjust: the more red, the more significant

# Enrichment maps: show how the significant functional groups relate to each other
library(enrichplot)
sig_gene_enr <- pairwise_termsim(sig_gene_enr)
emapplot(sig_gene_enr, showCategory = 15, cex_label_category = 0.6) + 
  theme(text = element_text(size = 7))

# What can we learn from the plot?
# Dot: individual gene set
# Thickness of bond between 2 dots: how many overlapping significant genes between 2 individual gene sets?
```


# Gene Set Enrichment Analysis (GSEA)
* Test any association of our gene set with the `ranking` of all our genes [not merely differentially expressed genes]. The functional enrichment analysis may ignore the functional relevance of genes that are filtered out based on the cut-off (e.g., `padj` < 0.05). 
* `lfcShrink()` shrinks the `log2 fold-change` values of genes that do not have much significance. This allows us to use the `log2 fold-change` as a **measure of significance** of change in our ranking for analysis and in programs such as `GSEA`.


## Create a ranked and named vector of gene scores
```{r}
# Create a ranked and named vector of gene scores. Here we rank genes using the `stat` to give sensible measure of differential expression. If we have the modified log2 fold changes using `lfsShrink()`, we could also use log2FoldChange to rank genes
forRNK <- Activated_minus_Resting$stat[order(Activated_minus_Resting$stat, decreasing = T)]
names(forRNK) <- Activated_minus_Resting$ENTREZID
forRNK[1:6]
```


## View the MSigDb enrichment of C7 (immunological signature)
```{r}
library(msigdbr)
mm_c7 <- msigdbr(species = "Mus musculus", category = "C7")[,c("gs_name", "entrez_gene")]
```


## Run the GSEA
```{r}
library(clusterProfiler)
sig_gene_enr_GSEA <- GSEA(forRNK, TERM2GENE = mm_c7, eps = 1e-100) # eps: This parameter sets the boundary for calculating the p value.
sig_gene_enr_GSEA
class(sig_gene_enr_GSEA) # gseaResult
```


## Visualize the results
```{r}
# Create a dotplot
dotplot(sig_gene_enr_GSEA, showCategory = 6) + 
  theme(axis.text.y = element_text(size = 7))

# Create an enrichment map
library(enrichplot)
sig_gene_enr_GSEA <- pairwise_termsim(sig_gene_enr_GSEA)
emapplot(sig_gene_enr_GSEA, showCategory = 10, cex_label_category = 0.6) + 
  theme(text = element_text(size = 7))

# Create a running score plot, which locates the most significant functinonal group
gseaplot(sig_gene_enr_GSEA, geneSetID = 1, by = "runningScore", title = "GSE15330_HSC_VS_LYMPHOID_PRIMED_MULTIPOTENT_PROGENITOR_DN")
```


## Save the GSEA results
```{r}
write.csv(as.data.frame(sig_gene_enr_GSEA), "cluster_profiler_GSEA_result.csv")
```


# Exercise
## *`goseq` cellular component analysis
Identify GO term cellular component groups enriched in genes significantly upregulated in liver with goseq. What are the top 5 terms?
```{r}
# Load the differential expression data
heart.minus.liver <- read.csv("./data/Heart_minus_liver.csv")
heart.minus.liver[1:6, ]

# Screen out NA in `padj`
heart.minus.liver <- heart.minus.liver[!is.na(heart.minus.liver$padj), ]
heart.minus.liver[1:6, ]

# Perform the functional enrichment
# Create a named vector of 1s or 0s indicating if a gene is upregulated or downregulated in the liver
up.in.liver <- (heart.minus.liver$padj < 0.05 & heart.minus.liver$log2FoldChange < 0) |> as.integer()
names(up.in.liver) <- heart.minus.liver$ID
up.in.liver[1:6]
table(up.in.liver)

# List supported genomes
library(goseq)
supGenomes <- supportedGenomes() 
supGenomes[1:6, ]

# Examine the gene length bias in differential expression analysis and remove any artefactual enrichment for long and short genes 
pwf <- nullp(DEgenes = up.in.liver, genome = "mm10", id = "knownGene", plot.fit = T)

# Identify GO term cellular component groups enriched in genes significantly upregulated in liver
# Specify the genome build and ID (here we match TxDb.UCSC.mm10.knownGene.db) and the categories we plan to test (GO:CC - cellular component)
GO.up.in.liver <- goseq(pwf, genome = "mm10", id = "knownGene", test.cats = c("GO:CC"))
GO.up.in.liver[1:6, ]
```


## ClusterProfiler for KEGG enrichment
Plot the -log10 pvalue for top 5 terms in a plot.
```{r}
library(clusterProfiler)

# Create a ranked and named vector of gene scores
gene.score <- heart.minus.liver$stat[order(heart.minus.liver $stat, decreasing = T)]
names(gene.score) <- heart.minus.liver$ID
gene.score[1:6]

KEGG.heart.liver <- gseKEGG(
  geneList = gene.score,
  organism = "mmu",
  keyType = "kegg", 
  eps = 10e-100)

head(KEGG.heart.liver)

plot(factor(KEGG.heart.liver$ID[1:5], levels = c("mmu05330", "mmu00360", "mmu05332", "mmu00040", "mmu04612")), 
     -log10(KEGG.heart.liver$p.adjust)[1:5], 
     main = "Top 5 KEGG enriched terms",
     ylab = "-log10(p.adj)", 
     type = "n",
     las = 2) # las: make the label perpendicular to the x-axis
points(x = -log10(KEGG.heart.liver$p.adjust)[1:5], cex = 2, col = "dark red", bg = "dark red", pch = 21)
```


## Show the top 5 KEGG enriched terms in a dotplot 
```{r}
library(ggplot2)
clusterProfiler::dotplot(KEGG.heart.liver, showCategory = 5) +
  theme(axis.text.y = element_text(size = 7)) 
```


## Show the top 5 KEGG enriched terms in an enrichment map
```{r}
library(enrichplot)
KEGG.heart.liver.pt <- pairwise_termsim(KEGG.heart.liver)
emapplot(sig_gene_enr, showCategory = 5, cex_label_category = 0.6) + 
  theme(text = element_text(size = 7))
```


## *Heatmap of leading edge genes
**Draw a heatmap of the genes driving enrichment in the top 3 KEGG terms**. Use `rlog` counts, scale across rows using a Z-score, and include the kidney data as a reference point.
```{r}
library(pheatmap)
library(DESeq2)
gene.id <- unlist(strsplit(KEGG.heart.liver$core_enrichment[1:3],"/"))

load("data/gC_TissueFull.RData")
dds <- DESeqDataSet(geneCounts, design = ~Tissue)
dds$Tissue <- relevel(dds$Tissue, ref = "Kidney")
dds <- DESeq(dds)
dds.res <- results(dds, contrast = c("Tissue", "Liver", "Heart"))
rlog.tissue <- rlog(dds)
mat <- assay(rlog.tissue)[rownames(rlog.tissue) %in% gene.id,]
anno.df <- as.data.frame(colData(dds)) # colData accesses the metadata [Tissue, sizeFactor]
anno.df
anno.df <- anno.df[, 1, drop = F] # Keep the Tissue column only
anno.df

pheatmap(mat,
         scale="row", 
         show_rownames = F,
         annotation_col = anno.df)
```