-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGeneSetTCell.Rmd
354 lines (282 loc) · 14.1 KB
/
GeneSetTCell.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
---
title: "Gene Set Enrichment Analysis Using Mouse T-Regulatory Cell RNA-Seq Data"
author: "Anni Liu"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
html_document:
code_folding: show
---
```{r, shorcut, include=FALSE}
## RStudio keyboard shortcut
# Cursor at the beginning of a command line: Ctrl+A
# Cursor at the end of a command line: Ctrl+E
# Clear all the code from your console: Ctrl+L
# Create a pipe operator %>%: Ctrl+Shift+M (Windows) or Cmd+Shift+M (Mac)
# Create an assignment operator <-: Alt+- (Windows) or Option+-(Mac)
# Knit a document (knitr): Ctrl+Shift+K (Windows) or Cmd+Shift+K (Mac)
# Comment or uncomment current selection: Ctrl+Shift+C (Windows) or Cmd+Shift+C (Mac)
```
# Load and save images
```{r}
load("2023Feb17RNAseq_GeneSet_ALiu.RData")
```
```{r}
image.date <- format(Sys.Date(), "%Y%b%d")
save.image(file = paste0(image.date, "RNAseq_GeneSet_ALiu.RData"))
```
# Tutorial
# Questions of interest
* Do cell cycle genes change more [enrich] between conditions than other genes?
* Are genes related to "immune response" enriched in any clusters?
# Sources of gene sets
* [GO consortium](http://geneontology.org)
* [REACTOME](https://reactome.org)
* [KEGG](https://www.genome.jp/kegg/)
* [MsigDB](https://www.gsea-msigdb.org/gsea/msigdb/)
# MSigDB and gene set collections
## Read the hallmarks gene set with gene symbols using `getGmt()`
```{r}
library(GSEABase)
hallMarks <- getGmt(con = "./data/h.all.v7.1.symbols.gmt")
hallMarks
class(hallMarks) # GeneSetCollection; list-style
names(hallMarks) # Access the names of all gene sets
hallMarks[[1]] # Read the first gene set which has 200 genes
```
## List the gene symbols within a gene set using `geneIds`
```{r}
geneIds(hallMarks)[1] # Show the gene symbols affiliated with the first gene set
```
## Work with `{msigdbr}`
```{r}
library(msigdbr)
mm_H <- msigdbr(species = "Mus musculus", category = "H") # Extract all the mouse hallmark gene sets
# species: Species name, such as Homo sapiens or Mus musculus.
# category: MSigDB collection abbreviation, such as H or C1.
# subcategory: MSigDB sub-collection abbreviation, such as CGP (chemical and genetic pertubations) or BP (biological process).
# https://www.gsea-msigdb.org/gsea/msigdb/human/genesets.jsp
# Note that these mouse gene sets are computationally derived from orthologous analysis; they may not be perfect
head(mm_H)
```
# *Test gene set enrichment
* First method - functional enrichment: test any association of our gene set with our group of differentially expressed genes
* Second method - Gene Set Enrichment Analysis (GSEA): test any association of our gene set with the `ranking` of all our genes [not merely differentially expressed genes]
+ Why GSEA?
+ To study gene expression profiles, researchers commonly identify interesting genes that show different levels of expression.
+ The functional enrichment analysis method focuses on these differentially expressed genes, but it may not work well in cases where the difference in expression levels is small but coordinated among related genes.
+ To overcome this limitation, GSEA can be used, which allows for the use of all genes and aggregates the per gene statistics across genes within a gene set to detect small but consistent changes in gene sets.
+ This is important because small but coordinated changes in a group of genes are likely to be relevant to many phenotypic differences.
# Functional enrichment
## Load the differential expression data
```{r}
Activated_minus_Resting <- read.csv("./data/Group_Activated_minus_Resting.csv")
Activated_minus_Resting[1:6, ]
```
## Screen out NA in `padj`
```{r}
Activated_minus_Resting <- Activated_minus_Resting[!is.na(Activated_minus_Resting$padj), ]
Activated_minus_Resting[1:6, ]
```
## *Perform the functional enrichment using `goseq`
```{r}
# Create a named vector of 1s or 0s indicating if a gene is upregulated or downregulated
UpInAct <- (Activated_minus_Resting$padj < 0.05 & Activated_minus_Resting$log2FoldChange > 0) |> as.integer()
names(UpInAct) <- Activated_minus_Resting$ENTREZID
UpInAct[1:6]
table(UpInAct)
# List supported genomes
# BiocManager::install("TxDb.Mmusculus.UCSC.mm10.knownGene")
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(goseq)
supGenomes <- supportedGenomes()
supGenomes[1:6, ]
# [!I mportant step]Examine the gene length bias in differential expression analysis and remove any artefactual enrichment for long and short genes
nullp(DEgenes = UpInAct, genome = "mm10", id = "knownGene", plot.fit = T)
pwf <- nullp(DEgenes = UpInAct, genome = "mm10", id = "knownGene", plot.fit = T)
# What can we learn from the plot?
# If there is no bias in differential expression analysis caused by gene length, the line will be horizontal.
# In this plot, the line is not horizontal but bent, indicating the gene length bias in the dataset.
# Identify the top enriched functional terms in GO
# Specify the genome build and ID (here we match TxDb.UCSC.mm10.knownGene.db) and the categories we plan to test (GO:BP, GO:MF, GO:CC, KEGG). Unfortunately, goseq does not contain the gene set in the msigdbr.
library(org.Mm.eg.db)
GO_UpInAct <- goseq(pwf, genome = "mm10", id = "knownGene",
test.cats = c("GO:BP"))
GO_UpInAct[1:6, ]
# "Wallenius" approximates the true distribution of numbers of members of a category amongst DE genes by the Wallenius non-central hypergeometric distribution. This distribution assumes that within a category all genes have the same probability of being chosen. Therefore, this approximation works best when the range in probabilities obtained by the probability weighting function is small. "Wallenius" is the recommended method for calculating p-values.
# Retrieve the genes in the immune response functional group (GO:0006955)
library(org.Mm.eg.db)
keytypes(org.Mm.eg.db) # Equivalent
# [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
# [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
# [9] "EVIDENCEALL" "GENENAME" "GENETYPE" "GO"
# [13] "GOALL" "IPI" "MGI" "ONTOLOGY"
# [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
# [21] "PROSITE" "REFSEQ" "SYMBOL" "UNIPROT"
keys(org.Mm.eg.db, keytype = "GO")
ImmuneResponseGenes <- AnnotationDbi::select(x = org.Mm.eg.db, keytype = "GOALL", keys = "GO:0002376", columns = c("ENTREZID", "ENSEMBL", "SYMBOL"))
ImmuneResponseGenes # Show the gene set of immune response
# Subset the differential gene expression table using the genes in the immune response functional group (GO:0002376)
IRG_Entrez <- unique(ImmuneResponseGenes$ENTREZID)
IRG_Res <- Activated_minus_Resting[Activated_minus_Resting$ENTREZID %in% IRG_Entrez, ]
write.csv(IRG_Res,
file = "./data/ImmuneResponseDEGeneTable.csv",
row.names = F)
```
## ClusterProfiler
### Perform the functional enrichment
```{r}
library(org.Mm.eg.db)
library(clusterProfiler)
sig_genes <- Activated_minus_Resting[!is.na(Activated_minus_Resting$padj) & Activated_minus_Resting$padj < 0.05, 1]
sig_genes[1:6]
sig_gene_enr <- enrichGO(sig_genes, OrgDb = org.Mm.eg.db)
sig_gene_enr
sig_gene_enr@result |> View()
class(sig_gene_enr) # enrichResult
sig_gene_enr_res <- sig_gene_enr |> as.data.frame()
write.csv(sig_gene_enr_res,
file = "./data/sig_gene_enr_res.csv",
row.names = F)
```
### *Visualize the results
```{r}
# Dotplot
library(ggplot2)
clusterProfiler::dotplot(sig_gene_enr, showCategory = 8) +
theme(axis.text.y = element_text(size = 7)) # Make the GO term description smaller
# What can we learn from the plot?
# GeneRatio on the X-axis: proportion of genes significantly changed between the activated status and resting status which are in the gene set among all genes significantly changed between the activated status and resting status, given that these significantly changed genes have the annotations
# Count: number of genes in the gene set that are significantly changed between the activated status and resting status
# p.adjust: the more red, the more significant
# Enrichment maps: show how the significant functional groups relate to each other
library(enrichplot)
sig_gene_enr <- pairwise_termsim(sig_gene_enr)
emapplot(sig_gene_enr, showCategory = 15, cex_label_category = 0.6) +
theme(text = element_text(size = 7))
# What can we learn from the plot?
# Dot: individual gene set
# Thickness of bond between 2 dots: how many overlapping significant genes between 2 individual gene sets?
```
# Gene Set Enrichment Analysis (GSEA)
* Test any association of our gene set with the `ranking` of all our genes [not merely differentially expressed genes]. The functional enrichment analysis may ignore the functional relevance of genes that are filtered out based on the cut-off (e.g., `padj` < 0.05).
* `lfcShrink()` shrinks the `log2 fold-change` values of genes that do not have much significance. This allows us to use the `log2 fold-change` as a **measure of significance** of change in our ranking for analysis and in programs such as `GSEA`.
## Create a ranked and named vector of gene scores
```{r}
# Create a ranked and named vector of gene scores. Here we rank genes using the `stat` to give sensible measure of differential expression. If we have the modified log2 fold changes using `lfsShrink()`, we could also use log2FoldChange to rank genes
forRNK <- Activated_minus_Resting$stat[order(Activated_minus_Resting$stat, decreasing = T)]
names(forRNK) <- Activated_minus_Resting$ENTREZID
forRNK[1:6]
```
## View the MSigDb enrichment of C7 (immunological signature)
```{r}
library(msigdbr)
mm_c7 <- msigdbr(species = "Mus musculus", category = "C7")[,c("gs_name", "entrez_gene")]
```
## Run the GSEA
```{r}
library(clusterProfiler)
sig_gene_enr_GSEA <- GSEA(forRNK, TERM2GENE = mm_c7, eps = 1e-100) # eps: This parameter sets the boundary for calculating the p value.
sig_gene_enr_GSEA
class(sig_gene_enr_GSEA) # gseaResult
```
## Visualize the results
```{r}
# Create a dotplot
dotplot(sig_gene_enr_GSEA, showCategory = 6) +
theme(axis.text.y = element_text(size = 7))
# Create an enrichment map
library(enrichplot)
sig_gene_enr_GSEA <- pairwise_termsim(sig_gene_enr_GSEA)
emapplot(sig_gene_enr_GSEA, showCategory = 10, cex_label_category = 0.6) +
theme(text = element_text(size = 7))
# Create a running score plot, which locates the most significant functinonal group
gseaplot(sig_gene_enr_GSEA, geneSetID = 1, by = "runningScore", title = "GSE15330_HSC_VS_LYMPHOID_PRIMED_MULTIPOTENT_PROGENITOR_DN")
```
## Save the GSEA results
```{r}
write.csv(as.data.frame(sig_gene_enr_GSEA), "cluster_profiler_GSEA_result.csv")
```
# Exercise
## *`goseq` cellular component analysis
Identify GO term cellular component groups enriched in genes significantly upregulated in liver with goseq. What are the top 5 terms?
```{r}
# Load the differential expression data
heart.minus.liver <- read.csv("./data/Heart_minus_liver.csv")
heart.minus.liver[1:6, ]
# Screen out NA in `padj`
heart.minus.liver <- heart.minus.liver[!is.na(heart.minus.liver$padj), ]
heart.minus.liver[1:6, ]
# Perform the functional enrichment
# Create a named vector of 1s or 0s indicating if a gene is upregulated or downregulated in the liver
up.in.liver <- (heart.minus.liver$padj < 0.05 & heart.minus.liver$log2FoldChange < 0) |> as.integer()
names(up.in.liver) <- heart.minus.liver$ID
up.in.liver[1:6]
table(up.in.liver)
# List supported genomes
library(goseq)
supGenomes <- supportedGenomes()
supGenomes[1:6, ]
# Examine the gene length bias in differential expression analysis and remove any artefactual enrichment for long and short genes
pwf <- nullp(DEgenes = up.in.liver, genome = "mm10", id = "knownGene", plot.fit = T)
# Identify GO term cellular component groups enriched in genes significantly upregulated in liver
# Specify the genome build and ID (here we match TxDb.UCSC.mm10.knownGene.db) and the categories we plan to test (GO:CC - cellular component)
GO.up.in.liver <- goseq(pwf, genome = "mm10", id = "knownGene", test.cats = c("GO:CC"))
GO.up.in.liver[1:6, ]
```
## ClusterProfiler for KEGG enrichment
Plot the -log10 pvalue for top 5 terms in a plot.
```{r}
library(clusterProfiler)
# Create a ranked and named vector of gene scores
gene.score <- heart.minus.liver$stat[order(heart.minus.liver $stat, decreasing = T)]
names(gene.score) <- heart.minus.liver$ID
gene.score[1:6]
KEGG.heart.liver <- gseKEGG(
geneList = gene.score,
organism = "mmu",
keyType = "kegg",
eps = 10e-100)
head(KEGG.heart.liver)
plot(factor(KEGG.heart.liver$ID[1:5], levels = c("mmu05330", "mmu00360", "mmu05332", "mmu00040", "mmu04612")),
-log10(KEGG.heart.liver$p.adjust)[1:5],
main = "Top 5 KEGG enriched terms",
ylab = "-log10(p.adj)",
type = "n",
las = 2) # las: make the label perpendicular to the x-axis
points(x = -log10(KEGG.heart.liver$p.adjust)[1:5], cex = 2, col = "dark red", bg = "dark red", pch = 21)
```
## Show the top 5 KEGG enriched terms in a dotplot
```{r}
library(ggplot2)
clusterProfiler::dotplot(KEGG.heart.liver, showCategory = 5) +
theme(axis.text.y = element_text(size = 7))
```
## Show the top 5 KEGG enriched terms in an enrichment map
```{r}
library(enrichplot)
KEGG.heart.liver.pt <- pairwise_termsim(KEGG.heart.liver)
emapplot(sig_gene_enr, showCategory = 5, cex_label_category = 0.6) +
theme(text = element_text(size = 7))
```
## *Heatmap of leading edge genes
**Draw a heatmap of the genes driving enrichment in the top 3 KEGG terms**. Use `rlog` counts, scale across rows using a Z-score, and include the kidney data as a reference point.
```{r}
library(pheatmap)
library(DESeq2)
gene.id <- unlist(strsplit(KEGG.heart.liver$core_enrichment[1:3],"/"))
load("data/gC_TissueFull.RData")
dds <- DESeqDataSet(geneCounts, design = ~Tissue)
dds$Tissue <- relevel(dds$Tissue, ref = "Kidney")
dds <- DESeq(dds)
dds.res <- results(dds, contrast = c("Tissue", "Liver", "Heart"))
rlog.tissue <- rlog(dds)
mat <- assay(rlog.tissue)[rownames(rlog.tissue) %in% gene.id,]
anno.df <- as.data.frame(colData(dds)) # colData accesses the metadata [Tissue, sizeFactor]
anno.df
anno.df <- anno.df[, 1, drop = F] # Keep the Tissue column only
anno.df
pheatmap(mat,
scale="row",
show_rownames = F,
annotation_col = anno.df)
```