template for WGCNA

upendrabhattarai · upendrabhattarai · commit bb9ed3458135 · 2025-01-30T16:15:39.000-05:00
diff --git a/inst/templates/rnaseq/WGCNA/WGCNA.Rmd b/inst/templates/rnaseq/WGCNA/WGCNA.Rmd
@@ -0,0 +1,359 @@
+---
+title: "Weighted Gene Co-Expression Network Analysis (WGCNA)"
+author: "Harvard Chan Bioinformatics Core"
+date: "`r Sys.Date()`"
+output:
+   html_document:
+      code_folding: hide
+      df_print: paged
+      highlights: pygments
+      number_sections: true
+      self_contained: true
+      theme: default
+      toc: true
+      toc_float:
+         collapsed: true
+         smooth_scroll: true
+editor_options: 
+  chunk_output_type: inline
+params:
+  minModuleSize: 30
+  maxBlockSize: 4000
+  mergeCutHeight: 0.25
+  column: "sample_type"
+  project_file: ../information.R
+  params_file: params_de-example.R
+  functions_file: ../libs 
+---
+
+```{r, message=FALSE, warning=FALSE}
+# This set up the working directory to this file so all files can be found
+library(rstudioapi)
+library(tidyverse)
+setwd(fs::path_dir(getSourceEditorContext()$path))
+# NOTE: This code will check version, this is our recommendation, it may work
+#.      other versions
+stopifnot(R.version$major>= 4) # requires R4
+if (compareVersion(R.version$minor,"3.1")<0) warning("We recommend >= R4.3.1") 
+stopifnot(compareVersion(as.character(BiocManager::version()), "3.18")>=0)
+```
+
+This code is in this ![](https://img.shields.io/badge/status-alpha-yellow) revision.
+
+
+
+```{r load_params, cache = FALSE, message = FALSE, warning=FALSE}
+source(params$project_file)
+source(params$params_file)
+map(list.files(params$functions_file,pattern = "*.R$",full.names = T),source) %>% invisible()
+column=params$column
+contrasts=params$contrasts
+subset_column=params$subset_column
+subset_value=params$subset_value
+
+```
+
+
+```{r sanitize_datatable}
+sanitize_datatable = function(df, ...) {
+ # remove dashes which cause wrapping
+ DT::datatable(df, ..., rownames=gsub("-", "_", rownames(df)),
+                   colnames=gsub("-", "_", colnames(df)))
+}
+```
+
+
+```{r load_libraries, cache = FALSE, message = FALSE, warning=FALSE}
+library(WGCNA)
+library(tidyverse)
+library(DESeq2)
+library(magrittr)
+library(rtracklayer)
+library(flashClust)
+library(gridExtra)
+library(DT)
+library(AnnotationDbi)
+library(bcbioR)
+
+#library(fgsea)
+#library(org.Hs.eg.db)
+library(knitr)
+#library(EnhancedVolcano)
+#library(bcbioR)
+library(ggprism)
+#library(viridis)
+#library(pheatmap)
+#library(janitor)
+#library(ggforce)
+#library(vegan)
+#library(htmltools)
+
+colors=cb_friendly_cols(1:15)
+ggplot2::theme_set(theme_prism(base_size = 14))
+opts_chunk[["set"]](
+    cache = F,
+    cache.lazy = FALSE,
+    dev = c("png", "pdf"),
+    error = TRUE,
+    highlight = TRUE,
+    message = FALSE,
+    prompt = FALSE,
+    tidy = FALSE,
+    warning = FALSE,
+    echo = T, 
+    fig.height = 4)
+
+# set seed for reproducibility
+set.seed(1234567890L)
+```
+
+# Overview
+
+-   Project: `r project`
+-   PI: `r PI`
+-   Analyst: `r analyst`
+-   Experiment: `r experiment`
+-   Aim: `r aim`
+
+```{r load_data}
+
+coldata <- load_coldata(coldata_fn, column,
+                        subset_column, subset_value)
+#coldata[[contrasts[[1]][1]]] = relevel(as.factor(coldata[[contrasts[[1]][1]]]), contrasts[[1]][3])
+coldata$sample=row.names(coldata)
+
+counts <- load_counts(counts_fn)
+counts <- counts[,colnames(counts) %in% coldata$sample]
+
+# It's easier for future items if the metadata(genotype) is already set up as a factor
+#coldata$sample_type <- as.factor(coldata$sample_type)
+
+```
+
+
+# WGCNA
+
+WGCNA carries out a correlation network construction. Networks are visual representations of interactions between `nodes` in a system. The nodes in WGCNA are individual genes. So, it helps to visualize patterns and relationships between gene expression profiles to identify genes associated with measured traits as well as identifying genes that are consistently co-expressed and could be contributing to similar molecular pathways. It also accounts for inter-individual variation in gene expression and alleviates issues associated with multiple testing.
+
+We will use DESeq2 to transform our RNA-seq count data before running WGCNA. Removing the low count genes before the WGCNA is recommended and can help improve the WGCNA results. 
+
+For this analysis we are keeping genes with **10 or more reads in total and expressed in at least 3 samples.**
+
+For the data normalization we will use variance stabilizing transformation as recommended by the WGCNA's authors and transpose the dataset to prepare for WGCNA analysis. 
+
+
+# Data
+
+```{r show_coldata}
+coldata %>% sanitize_datatable()
+```
+
+```{r normalize_data}
+dds <- DESeqDataSetFromMatrix(counts, 
+                              colData = coldata, 
+                              design = ~ 1)
+
+## Filtering lowly expressed genes
+# We are filtering out genes with fewer than 10 raw counts in total and are present in fewer than 3 samples.
+keep <- rowSums(counts(dds)>=10) >=4
+dds <- dds[keep, ]
+
+# Retrive the normalized data from the DESeqDataSet
+dds_norm <- vst(dds)
+normalized_counts <- assay(dds_norm) %>%
+  t() # Transpose this data
+#At this point you can look into the data and remove any outlier samples, as those outlier samples can affect the WGCNA results. However, our dataset as analysed earlier doesnot have obvious outliers.
+
+```
+
+
+## Parameter settings for WGCNA
+WGCNA creates a weighted network to define which genes are near each other. The measure of adjacency used is based on the correlation matrix, which requires the definition of a threshold value, which in turn depends on a "power" parameter that defines the exponent used when transforming the correlation values. 
+
+The choice of power parameter will affect the number of modules identified. WGCNA has a function called pickSoftThreshold() to help determine the soft power threshold.
+
+
+```{r, message=FALSE, warning=FALSE,fig.align='center'}
+
+# determine parameters for WGCNA
+sft <- pickSoftThreshold(normalized_counts,
+  dataIsExpr = TRUE,
+  corFnc = cor,
+  networkType = "signed"
+)
+# We have to calculate a measure of the model fit, the signed R2, and make that a new variable.
+
+sft_df <- data.frame(sft$fitIndices) %>%
+  dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
+
+
+ggplot(sft_df, aes(x = Power, y = model_fit, label = Power)) +
+  geom_point() +
+  geom_text(nudge_y = 0.1) +
+  geom_hline(yintercept = 0.80, col = "red") +
+  ylim(c(min(sft_df$model_fit), 1.1)) +
+  xlab("Soft Threshold (power)") +
+  ylab("Scale Free Topology Model Fit, signed R^2") +
+  ggtitle("Scale independence") +
+  theme_classic()
+```
+
+
+It is recommend to use a power that has an signed R2 above 0.80, otherwise the results may be too noisy to draw any meanings. In case of multiple power values with signed R2 above 0.80, we need to pick the one at an inflection point or where the R2 value seem to have reached their saturation. You want a power that gives a big enough R2 but is not excessively large. 
+
+
+## One-step blockwise module detection:
+
+We will use the blockwiseModules() with power threshold of 20 to find co-expressed genes. It will construct modules with group of genes that are highly correlated.
+
+```{r, message=FALSE, warning=FALSE}
+# picking a soft threshold power near the curve of the plot 
+picked_power = 20  ## pick a power here based on the above plot and instruction
+temp_cor <- cor
+cor <- WGCNA::cor         
+bwnet <- blockwiseModules(normalized_counts,                
+
+                          # == Adjacency Function ==
+                          power = picked_power,                
+                          networkType = "signed", # there is option to run unsigned networktype as well.
+
+                          # == Set Tree and Block parameters ==
+                          deepSplit = 2,
+                          pamRespectsDendro = F,
+                          # detectCutHeight = 0.75,
+                          minModuleSize = params$minModuleSize,
+                          maxBlockSize = params$maxBlockSize,
+
+                          # == Module Adjustments ==
+                          reassignThreshold = 0,
+                          mergeCutHeight = params$mergeCutHeight,
+
+                          # == TOM == Archive the run results in TOM file (saves time)
+                        # saveTOMs = T,
+                        # saveTOMFileBase = "ER",
+
+                          # == Output Options
+                          numericLabels = T,
+                          verbose = F)
+
+# Write down main WGCNA results object to file
+readr::write_rds(bwnet,
+                 file = "WGCNA_results.RDS")
+```
+
+## Eigengene module dataframe
+
+A module eigengene is the standardized gene expression profile for a given module. An eigengene is the gene whose expression is representative of the majority of the genes within a module.
+
+```{r, message=FALSE, warning=FALSE}
+module_eigengens <- bwnet$MEs
+datatable(module_eigengens, options = list(pageLength =10, scrollX=TRUE))
+
+```
+
+
+## Dendogram for the modules
+
+```{r, fig.align='center', fig.width=12, message=FALSE, warning=FALSE}
+# Convert labels to colors for plotting
+mergedColors =labels2colors(bwnet$colors)
+# Plot the dendrogram and the module colors underneath
+plotDendroAndColors(bwnet$dendrograms[[1]],
+                    mergedColors[bwnet$blockGenes[[1]]],
+                    "Module colors",
+                    dendroLabels = FALSE,
+                    hang = 0.03,
+                    addGuide = TRUE,
+                    guideHang = 0.05)
+
+```
+
+
+### Relating modules (Clusters) to geneIds. 
+GeneIds with their module color and number are given on the table below.
+
+```{r, message=FALSE, warning=FALSE}
+# Relating modules (Clusters) to genotypes
+module_df <- data.frame(
+  gene_id = names(bwnet$colors),
+  module_no = bwnet$colors,
+  colors = labels2colors(bwnet$colors)
+  
+)
+datatable(module_df, rownames = FALSE)
+# lets write out the file lising the genes and their modules
+write_delim(module_df, file = "List_of_genes_with_modules.txt", delim = "\t")
+```
+
+
+## Relating modules with genotypes
+
+WGCNA calcualtes and Eigengene (hypothetical central gene) for each module which makes it easier to associate sample groups with module cluster.
+
+```{r, message=FALSE, warning=FALSE}
+# Get Module Eigengens per cluster
+MEs0 <- moduleEigengenes(normalized_counts, mergedColors)$eigengenes
+
+# Reorder modules so similar modules are next to each other
+MEs0 <- orderMEs(MEs0)
+module_order = names(MEs0) %>% gsub("ME","", .)
+
+# Add sample names
+MEs0$samples = row.names(MEs0)
+
+# tidy the data
+mME = MEs0 %>%
+  pivot_longer(-samples) %>%
+  mutate(
+    name = gsub("ME", "", name),
+    name = factor(name, levels = module_order)
+  )
+```
+
+
+```{r, fig.align='center', fig.height=8, message=FALSE, warning=FALSE}
+mME %>% ggplot(., aes(x=samples, y=name, fill=value)) +
+  geom_tile() +
+  theme_bw() +
+  scale_fill_gradient2(
+    low = "blue",
+    high = "red3",
+    mid = "white",
+    midpoint = 0,
+    limit = c(-1,1)) +
+  theme(axis.text.x = element_text(angle=90)) +
+  labs(title = "Module-trait Relationships", y = "Modules", fill="corr")
+
+```
+
+On the heatmap above, samples are on the x-axis and different modules on y-axis.
+We can visually inspect this heatmap to understand how different modules are correlated to different sample groups. 
+
+We can also use another approach to extract only the modules with biggest differences and then correlate those modules for conditions.
+
+## Modules with biggest differences across treatments
+We will create a design matrix using `sample_type` in our metadata and run linear model on each module.
+
+```{r, message=FALSE, warning=FALSE}
+# to confirm our eigengens relate to our metadata from deseq object run the line below
+#all.equal(colnames(dds), rownames(module_eigengens))
+
+# Create a design matrix from the genotype variable
+des_mat <- model.matrix(~ coldata$sample_type)
+
+
+# lmFit() needs a transposed version of the matrix
+fit <- limma::lmFit(t(module_eigengens), design = des_mat)
+
+# Apply empirical Bayes to smooth standard errors
+fit <- limma::eBayes(fit)
+
+# Apply multiple testing correction and obtain stats
+stats_df <- limma::topTable(fit, number = ncol(module_eigengens)) %>%
+  tibble::rownames_to_column("module")
+
+datatable(stats_df, options = list(pageLength =10, scrollX=TRUE))
+```
+
+Modules with highest padj values are the ones have bigger differences between sample_groups. Look for these modules in the heatmap above.
+
diff --git a/inst/templates/rnaseq/WGCNA/params_de-example.R b/inst/templates/rnaseq/WGCNA/params_de-example.R
@@ -0,0 +1,18 @@
+# project params
+date = "YYYYMMDD"
+basedir <- './' # where to write down output files
+
+# params for bcbio
+# coldata_fn = "https://raw.githubusercontent.com/bcbio/bcbioR-test-data/main/rnaseq/bcbio/coldata.csv"
+# counts_fn = 'https://raw.githubusercontent.com/bcbio/bcbioR-test-data/main/rnaseq/bcbio/tximport-counts.csv'
+# se_object=url("https://raw.githubusercontent.com/bcbio/bcbioR-test-data/main/rnaseq/bcbio/bcbio-se.rds")
+#
+
+# Example data
+coldata_fn='https://raw.githubusercontent.com/bcbio/bcbioR-test-data/devel/rnaseq/nf-core/coldata.csv'
+counts_fn=url('https://raw.githubusercontent.com/bcbio/bcbioR-test-data/devel/rnaseq/nf-core/star_salmon/salmon.merged.gene_counts.tsv')
+# This folder is in the output directory inside multiqc folder
+multiqc_data_dir='https://raw.githubusercontent.com/bcbio/bcbioR-test-data/devel/rnaseq/nf-core/multiqc/star_salmon/multiqc-report-data/'
+# This file is inside the genome folder in the output directory
+gtf_fn='https://raw.githubusercontent.com/bcbio/bcbioR-test-data/main/devel/nf-core/genome/genome.filtered.gtf.gz'
+se_object = NA