Skip to content

Commit bb9ed34

Browse files
template for WGCNA
1 parent 8c7a25d commit bb9ed34

File tree

2 files changed

+377
-0
lines changed

2 files changed

+377
-0
lines changed
Lines changed: 359 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
---
2+
title: "Weighted Gene Co-Expression Network Analysis (WGCNA)"
3+
author: "Harvard Chan Bioinformatics Core"
4+
date: "`r Sys.Date()`"
5+
output:
6+
html_document:
7+
code_folding: hide
8+
df_print: paged
9+
highlights: pygments
10+
number_sections: true
11+
self_contained: true
12+
theme: default
13+
toc: true
14+
toc_float:
15+
collapsed: true
16+
smooth_scroll: true
17+
editor_options:
18+
chunk_output_type: inline
19+
params:
20+
minModuleSize: 30
21+
maxBlockSize: 4000
22+
mergeCutHeight: 0.25
23+
column: "sample_type"
24+
project_file: ../information.R
25+
params_file: params_de-example.R
26+
functions_file: ../libs
27+
---
28+
29+
```{r, message=FALSE, warning=FALSE}
30+
# This set up the working directory to this file so all files can be found
31+
library(rstudioapi)
32+
library(tidyverse)
33+
setwd(fs::path_dir(getSourceEditorContext()$path))
34+
# NOTE: This code will check version, this is our recommendation, it may work
35+
#. other versions
36+
stopifnot(R.version$major>= 4) # requires R4
37+
if (compareVersion(R.version$minor,"3.1")<0) warning("We recommend >= R4.3.1")
38+
stopifnot(compareVersion(as.character(BiocManager::version()), "3.18")>=0)
39+
```
40+
41+
This code is in this ![](https://img.shields.io/badge/status-alpha-yellow) revision.
42+
43+
44+
45+
```{r load_params, cache = FALSE, message = FALSE, warning=FALSE}
46+
source(params$project_file)
47+
source(params$params_file)
48+
map(list.files(params$functions_file,pattern = "*.R$",full.names = T),source) %>% invisible()
49+
column=params$column
50+
contrasts=params$contrasts
51+
subset_column=params$subset_column
52+
subset_value=params$subset_value
53+
54+
```
55+
56+
57+
```{r sanitize_datatable}
58+
sanitize_datatable = function(df, ...) {
59+
# remove dashes which cause wrapping
60+
DT::datatable(df, ..., rownames=gsub("-", "_", rownames(df)),
61+
colnames=gsub("-", "_", colnames(df)))
62+
}
63+
```
64+
65+
66+
```{r load_libraries, cache = FALSE, message = FALSE, warning=FALSE}
67+
library(WGCNA)
68+
library(tidyverse)
69+
library(DESeq2)
70+
library(magrittr)
71+
library(rtracklayer)
72+
library(flashClust)
73+
library(gridExtra)
74+
library(DT)
75+
library(AnnotationDbi)
76+
library(bcbioR)
77+
78+
#library(fgsea)
79+
#library(org.Hs.eg.db)
80+
library(knitr)
81+
#library(EnhancedVolcano)
82+
#library(bcbioR)
83+
library(ggprism)
84+
#library(viridis)
85+
#library(pheatmap)
86+
#library(janitor)
87+
#library(ggforce)
88+
#library(vegan)
89+
#library(htmltools)
90+
91+
colors=cb_friendly_cols(1:15)
92+
ggplot2::theme_set(theme_prism(base_size = 14))
93+
opts_chunk[["set"]](
94+
cache = F,
95+
cache.lazy = FALSE,
96+
dev = c("png", "pdf"),
97+
error = TRUE,
98+
highlight = TRUE,
99+
message = FALSE,
100+
prompt = FALSE,
101+
tidy = FALSE,
102+
warning = FALSE,
103+
echo = T,
104+
fig.height = 4)
105+
106+
# set seed for reproducibility
107+
set.seed(1234567890L)
108+
```
109+
110+
# Overview
111+
112+
- Project: `r project`
113+
- PI: `r PI`
114+
- Analyst: `r analyst`
115+
- Experiment: `r experiment`
116+
- Aim: `r aim`
117+
118+
```{r load_data}
119+
120+
coldata <- load_coldata(coldata_fn, column,
121+
subset_column, subset_value)
122+
#coldata[[contrasts[[1]][1]]] = relevel(as.factor(coldata[[contrasts[[1]][1]]]), contrasts[[1]][3])
123+
coldata$sample=row.names(coldata)
124+
125+
counts <- load_counts(counts_fn)
126+
counts <- counts[,colnames(counts) %in% coldata$sample]
127+
128+
# It's easier for future items if the metadata(genotype) is already set up as a factor
129+
#coldata$sample_type <- as.factor(coldata$sample_type)
130+
131+
```
132+
133+
134+
# WGCNA
135+
136+
WGCNA carries out a correlation network construction. Networks are visual representations of interactions between `nodes` in a system. The nodes in WGCNA are individual genes. So, it helps to visualize patterns and relationships between gene expression profiles to identify genes associated with measured traits as well as identifying genes that are consistently co-expressed and could be contributing to similar molecular pathways. It also accounts for inter-individual variation in gene expression and alleviates issues associated with multiple testing.
137+
138+
We will use DESeq2 to transform our RNA-seq count data before running WGCNA. Removing the low count genes before the WGCNA is recommended and can help improve the WGCNA results.
139+
140+
For this analysis we are keeping genes with **10 or more reads in total and expressed in at least 3 samples.**
141+
142+
For the data normalization we will use variance stabilizing transformation as recommended by the WGCNA's authors and transpose the dataset to prepare for WGCNA analysis.
143+
144+
145+
# Data
146+
147+
```{r show_coldata}
148+
coldata %>% sanitize_datatable()
149+
```
150+
151+
```{r normalize_data}
152+
dds <- DESeqDataSetFromMatrix(counts,
153+
colData = coldata,
154+
design = ~ 1)
155+
156+
## Filtering lowly expressed genes
157+
# We are filtering out genes with fewer than 10 raw counts in total and are present in fewer than 3 samples.
158+
keep <- rowSums(counts(dds)>=10) >=4
159+
dds <- dds[keep, ]
160+
161+
# Retrive the normalized data from the DESeqDataSet
162+
dds_norm <- vst(dds)
163+
normalized_counts <- assay(dds_norm) %>%
164+
t() # Transpose this data
165+
#At this point you can look into the data and remove any outlier samples, as those outlier samples can affect the WGCNA results. However, our dataset as analysed earlier doesnot have obvious outliers.
166+
167+
```
168+
169+
170+
## Parameter settings for WGCNA
171+
WGCNA creates a weighted network to define which genes are near each other. The measure of adjacency used is based on the correlation matrix, which requires the definition of a threshold value, which in turn depends on a "power" parameter that defines the exponent used when transforming the correlation values.
172+
173+
The choice of power parameter will affect the number of modules identified. WGCNA has a function called pickSoftThreshold() to help determine the soft power threshold.
174+
175+
176+
```{r, message=FALSE, warning=FALSE,fig.align='center'}
177+
178+
# determine parameters for WGCNA
179+
sft <- pickSoftThreshold(normalized_counts,
180+
dataIsExpr = TRUE,
181+
corFnc = cor,
182+
networkType = "signed"
183+
)
184+
# We have to calculate a measure of the model fit, the signed R2, and make that a new variable.
185+
186+
sft_df <- data.frame(sft$fitIndices) %>%
187+
dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
188+
189+
190+
ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
191+
geom_point() +
192+
geom_text(nudge_y = 0.1) +
193+
geom_hline(yintercept = 0.80, col = "red") +
194+
ylim(c(min(sft_df$model_fit), 1.1)) +
195+
xlab("Soft Threshold (power)") +
196+
ylab("Scale Free Topology Model Fit, signed R^2") +
197+
ggtitle("Scale independence") +
198+
theme_classic()
199+
```
200+
201+
202+
It is recommend to use a power that has an signed R2 above 0.80, otherwise the results may be too noisy to draw any meanings. In case of multiple power values with signed R2 above 0.80, we need to pick the one at an inflection point or where the R2 value seem to have reached their saturation. You want a power that gives a big enough R2 but is not excessively large.
203+
204+
205+
## One-step blockwise module detection:
206+
207+
We will use the blockwiseModules() with power threshold of 20 to find co-expressed genes. It will construct modules with group of genes that are highly correlated.
208+
209+
```{r, message=FALSE, warning=FALSE}
210+
# picking a soft threshold power near the curve of the plot
211+
picked_power = 20 ## pick a power here based on the above plot and instruction
212+
temp_cor <- cor
213+
cor <- WGCNA::cor
214+
bwnet <- blockwiseModules(normalized_counts,
215+
216+
# == Adjacency Function ==
217+
power = picked_power,
218+
networkType = "signed", # there is option to run unsigned networktype as well.
219+
220+
# == Set Tree and Block parameters ==
221+
deepSplit = 2,
222+
pamRespectsDendro = F,
223+
# detectCutHeight = 0.75,
224+
minModuleSize = params$minModuleSize,
225+
maxBlockSize = params$maxBlockSize,
226+
227+
# == Module Adjustments ==
228+
reassignThreshold = 0,
229+
mergeCutHeight = params$mergeCutHeight,
230+
231+
# == TOM == Archive the run results in TOM file (saves time)
232+
# saveTOMs = T,
233+
# saveTOMFileBase = "ER",
234+
235+
# == Output Options
236+
numericLabels = T,
237+
verbose = F)
238+
239+
# Write down main WGCNA results object to file
240+
readr::write_rds(bwnet,
241+
file = "WGCNA_results.RDS")
242+
```
243+
244+
## Eigengene module dataframe
245+
246+
A module eigengene is the standardized gene expression profile for a given module. An eigengene is the gene whose expression is representative of the majority of the genes within a module.
247+
248+
```{r, message=FALSE, warning=FALSE}
249+
module_eigengens <- bwnet$MEs
250+
datatable(module_eigengens, options = list(pageLength =10, scrollX=TRUE))
251+
252+
```
253+
254+
255+
## Dendogram for the modules
256+
257+
```{r, fig.align='center', fig.width=12, message=FALSE, warning=FALSE}
258+
# Convert labels to colors for plotting
259+
mergedColors =labels2colors(bwnet$colors)
260+
# Plot the dendrogram and the module colors underneath
261+
plotDendroAndColors(bwnet$dendrograms[[1]],
262+
mergedColors[bwnet$blockGenes[[1]]],
263+
"Module colors",
264+
dendroLabels = FALSE,
265+
hang = 0.03,
266+
addGuide = TRUE,
267+
guideHang = 0.05)
268+
269+
```
270+
271+
272+
### Relating modules (Clusters) to geneIds.
273+
GeneIds with their module color and number are given on the table below.
274+
275+
```{r, message=FALSE, warning=FALSE}
276+
# Relating modules (Clusters) to genotypes
277+
module_df <- data.frame(
278+
gene_id = names(bwnet$colors),
279+
module_no = bwnet$colors,
280+
colors = labels2colors(bwnet$colors)
281+
282+
)
283+
datatable(module_df, rownames = FALSE)
284+
# lets write out the file lising the genes and their modules
285+
write_delim(module_df, file = "List_of_genes_with_modules.txt", delim = "\t")
286+
```
287+
288+
289+
## Relating modules with genotypes
290+
291+
WGCNA calcualtes and Eigengene (hypothetical central gene) for each module which makes it easier to associate sample groups with module cluster.
292+
293+
```{r, message=FALSE, warning=FALSE}
294+
# Get Module Eigengens per cluster
295+
MEs0 <- moduleEigengenes(normalized_counts, mergedColors)$eigengenes
296+
297+
# Reorder modules so similar modules are next to each other
298+
MEs0 <- orderMEs(MEs0)
299+
module_order = names(MEs0) %>% gsub("ME","", .)
300+
301+
# Add sample names
302+
MEs0$samples = row.names(MEs0)
303+
304+
# tidy the data
305+
mME = MEs0 %>%
306+
pivot_longer(-samples) %>%
307+
mutate(
308+
name = gsub("ME", "", name),
309+
name = factor(name, levels = module_order)
310+
)
311+
```
312+
313+
314+
```{r, fig.align='center', fig.height=8, message=FALSE, warning=FALSE}
315+
mME %>% ggplot(., aes(x=samples, y=name, fill=value)) +
316+
geom_tile() +
317+
theme_bw() +
318+
scale_fill_gradient2(
319+
low = "blue",
320+
high = "red3",
321+
mid = "white",
322+
midpoint = 0,
323+
limit = c(-1,1)) +
324+
theme(axis.text.x = element_text(angle=90)) +
325+
labs(title = "Module-trait Relationships", y = "Modules", fill="corr")
326+
327+
```
328+
329+
On the heatmap above, samples are on the x-axis and different modules on y-axis.
330+
We can visually inspect this heatmap to understand how different modules are correlated to different sample groups.
331+
332+
We can also use another approach to extract only the modules with biggest differences and then correlate those modules for conditions.
333+
334+
## Modules with biggest differences across treatments
335+
We will create a design matrix using `sample_type` in our metadata and run linear model on each module.
336+
337+
```{r, message=FALSE, warning=FALSE}
338+
# to confirm our eigengens relate to our metadata from deseq object run the line below
339+
#all.equal(colnames(dds), rownames(module_eigengens))
340+
341+
# Create a design matrix from the genotype variable
342+
des_mat <- model.matrix(~ coldata$sample_type)
343+
344+
345+
# lmFit() needs a transposed version of the matrix
346+
fit <- limma::lmFit(t(module_eigengens), design = des_mat)
347+
348+
# Apply empirical Bayes to smooth standard errors
349+
fit <- limma::eBayes(fit)
350+
351+
# Apply multiple testing correction and obtain stats
352+
stats_df <- limma::topTable(fit, number = ncol(module_eigengens)) %>%
353+
tibble::rownames_to_column("module")
354+
355+
datatable(stats_df, options = list(pageLength =10, scrollX=TRUE))
356+
```
357+
358+
Modules with highest padj values are the ones have bigger differences between sample_groups. Look for these modules in the heatmap above.
359+
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# project params
2+
date = "YYYYMMDD"
3+
basedir <- './' # where to write down output files
4+
5+
# params for bcbio
6+
# coldata_fn = "https://raw.githubusercontent.com/bcbio/bcbioR-test-data/main/rnaseq/bcbio/coldata.csv"
7+
# counts_fn = 'https://raw.githubusercontent.com/bcbio/bcbioR-test-data/main/rnaseq/bcbio/tximport-counts.csv'
8+
# se_object=url("https://raw.githubusercontent.com/bcbio/bcbioR-test-data/main/rnaseq/bcbio/bcbio-se.rds")
9+
#
10+
11+
# Example data
12+
coldata_fn='https://raw.githubusercontent.com/bcbio/bcbioR-test-data/devel/rnaseq/nf-core/coldata.csv'
13+
counts_fn=url('https://raw.githubusercontent.com/bcbio/bcbioR-test-data/devel/rnaseq/nf-core/star_salmon/salmon.merged.gene_counts.tsv')
14+
# This folder is in the output directory inside multiqc folder
15+
multiqc_data_dir='https://raw.githubusercontent.com/bcbio/bcbioR-test-data/devel/rnaseq/nf-core/multiqc/star_salmon/multiqc-report-data/'
16+
# This file is inside the genome folder in the output directory
17+
gtf_fn='https://raw.githubusercontent.com/bcbio/bcbioR-test-data/main/devel/nf-core/genome/genome.filtered.gtf.gz'
18+
se_object = NA

0 commit comments

Comments
 (0)