BulkRNASeq/DEAnalysisTCell.Rmd

---
title: "Differential expression analysis using Mouse T-Regulatory Cell RNA-Seq Data"
author: "Anni Liu"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output: 
  html_document:
    code_folding: show
---

```{r, shorcut, include=FALSE}
## RStudio keyboard shortcut
# Cursor at the beginning of a command line: Ctrl+A
# Cursor at the end of a command line: Ctrl+E
# Clear all the code from your console: Ctrl+L
# Create a pipe operator %>%: Ctrl+Shift+M (Windows) or Cmd+Shift+M (Mac)
# Create an assignment operator <-: Alt+- (Windows) or Option+-(Mac) 
# Knit a document (knitr): Ctrl+Shift+K (Windows) or Cmd+Shift+K (Mac)
# Comment or uncomment current selection: Ctrl+Shift+C (Windows) or Cmd+Shift+C (Mac)
```


# Load and save images
```{r}
load("2023Feb02RNAseq_DEAnalysis_ALiu.RData")
```

```{r}
image.date <- format(Sys.Date(), "%Y%b%d")
save.image(file = paste0(image.date, "RNAseq_DEAnalysis_ALiu.RData"))
```

[Manual](https://rockefelleruniversity.github.io/RU_RNAseq/)

[Video](https://www.youtube.com/watch?v=SgqDsgaIkLU)


# Tutorial
# Count multiple raw RNAseq FASTQ data
## Use summarizeOverlaps
```{r}
library(Rsamtools)
bamFilesToCount <- c("Sorted_Treg_1.bam", "Sorted_Treg_2.bam",
                     "Sorted_Treg_act_1.bam", "Sorted_Treg_act_2.bam", "Sorted_Treg_act_3.bam")
names(bamFilesToCount) <- c("Sorted_Treg_1","Sorted_Treg_2",
                            "Sorted_Treg_act_1","Sorted_Treg_act_2", "Sorted_Treg_act_3")
myBams <- BamFileList(bamFilesToCount, yieldSize = 10000)

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(GenomicAlignments)
geneExons <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
geneCounts <- summarizeOverlaps(geneExons,  myBams, ignore.strand = T)
geneCounts
```


## Parallel counting on the same machine or high performance computing
```{r}
library(BiocParallel)
paramMulti <- MulticoreParam(workers = 4) # Use four cores
# paramSerial <- SerialParam() # No parallelization
register(paramMulti)
```


## Load counts
```{r}
library(SummarizedExperiment)
load("./data/GeneCounts.RData")
class(geneCounts) # RangedSummarizedExperiment
assay(geneCounts)[1:2, ] # The row is Entrez gene ID; the column is sample name
```


## Extract the gene position
```{r} 
rowRanges(geneCounts)[1:2, ] # Take a peek at the position information of the first two genes: 497097 and 19888
```


# Differential expression analysis
```{r}
metaData <- data.frame(Group = c("Naive", "Naive", "Act", "Act", "Act"),
                       row.names = colnames(geneCounts)) # Sample names
metaData
```


## Create a DESeq2 object
### Use DESeqDataSetFromMatrix
```{r}
library(DESeq2)
countMatrix <- assay(geneCounts)
countGRanges <- rowRanges(geneCounts) # Assign ranges to the Entrez gene ID
dds <- DESeqDataSetFromMatrix(countMatrix, 
                              colData = metaData, 
                              design = ~Group, # Compare naive and activated
                              rowRanges = countGRanges)
dds
```


### Use DESeqDataSet
```{r}
colData(geneCounts)$Group <- metaData$Group
geneCounts
dds <- DESeqDataSet(geneCounts, design = ~Group)
dds
```


## Normalization
`DEseq()` normalizes the data by estimating the median expression of genes across all samples to produce a per sample normalization factor (library size). Normalized counts are counts divided by the library scaling factor, accounting for the sequencing depth.


### Retrieve normalized and unnormalized values
```{r}
dds <- DESeq(dds)
# estimating size factors
# estimating dispersions
# gene-wise dispersion estimates
# mean-dispersion relationship
# final dispersion estimates
# fitting model and testing
normCounts <- counts(dds, normalized = T)
normCounts[1:2, ]

oriCounts <- counts(dds, normalized = F)
oriCounts[1:2, ]
```


## Estimate variance
`DEseq()` shrinks the gene expression variance depending on the mean to increase the likelihood of detecting the significant differentially expressed genes with the low biological replicate number.
```{r}
plotDispEsts(dds)
# Black dots: (mean of normalized estimates of gene counts, variance)
# Blue line: adjusted dispersions, which are shrunk to be more similar to each other; the outliers are not shrunk, indicating there may exist the real high dispersion/variance 
```


## Extract the countrast of interest - pairwise comparison
```{r}
# Log-fold change will consider the order - Act and Naive
# ! Typically put the mutant/treatment as the first and the wildtype/control as the second 
myRes <- results(dds, contrast = c("Group", "Act", "Naive"))
myRes
myRes.ordered <- myRes[order(myRes$pvalue), ]
myRes.ordered[1:3, ]

# baseMean: average normalized gene counts across all sample replicates; give an idea if this gene is highly expressed or lowly expressed
# log2FoldChange: log2 of fold change between groups being compared
# pvalue: significance of changes between groups
# padj: p-value corrected for false discovery rate
```


## Global view of differential expression changes
```{r}
summary(myRes.ordered)
# summary(myRes) # equivalent
```


## Visulize the relationship between fold-changes and expression levels
```{r}
plotMA(myRes.ordered) # Blue dots are significant differentially expressed genes; lower counts are hard to call significance, although the variance of lower counts have been shrunk
```


## Downweight genes with high-fold changes but low significance
This allows us to use the `log2 fold-change` as a **measure of significance** of change in our ranking for analysis and in programs such as `GSEA`. 
```{r}
resultsNames(dds) # Extract the coefficient name: "Group_Naive_vs_Act"
myRes.lfc <- lfcShrink(dds, coef = "Group_Naive_vs_Act")
plotMA(myRes.lfc)
```


## Convert DESeqResults objects into a data frame
```{r}
class(myRes.ordered) # DESeqResults
myRes.ordered.DF <- as.data.frame(myRes.ordered)
myRes.ordered.DF[1:2, ]
```


## Show the results of `padj < 0.05` and `log2FoldChange > 2`
```{r}
myRes.DF.s <- with(myRes.DF.new, myRes.DF.new[padj < 0.05 & !is.na(padj) & log2FoldChange > 2, ])
myRes.DF.s[1:3, ]
```


## Link transcript quantification by Salmon to DESeq2
### Load sf data
```{r}
library(tximport)
temp <- read.delim("./data/TReg_2_Quant/quant.sf")
temp[1:3, ] # Data is shown at the transcript level, not the gene level
```


### Map the transcript ID to the gene ID
```{r}
library(TxDb.Mmusculus.UCSC.mm10.knownGene) # Load the annotation database
keytypes(TxDb.Mmusculus.UCSC.mm10.knownGene)
# [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME"
# [5] "GENEID"   "TXID"     "TXNAME"  
Tx2Gene <- AnnotationDbi::select(x = TxDb.Mmusculus.UCSC.mm10.knownGene, 
                  keys = temp["Name"] |> unlist(), 
                  keytype = "TXNAME",
                  columns = c("GENEID", "TXNAME"))
Tx2Gene <- Tx2Gene[!is.na(Tx2Gene$GENEID), ]
Tx2Gene[1:10, ]
```


```{r}
sf.path <- dir(path = "data/", recursive = T, pattern = "quant.sf", full.names = T) # recursive: look for the target file inside the folder inside the data folder; full.names = T provides the full path name
sf.path

salmonCounts <- tximport(files = sf.path, type = "salmon", tx2gene = Tx2Gene)

salmonCounts$abundance[1:2, ] # Show the number of transcripts per million transcripts (TPM); TPM accounts for sequencing depth and gene length
salmonCounts$counts[1:2, ] # Show the estimated counts; used for DESeq2
```


### Build DESeq2 object from Tximport
```{r}
ddsSalmon <- DESeqDataSetFromTximport(txi = salmonCounts, colData = metaData, design = ~Group)
```


### Run DeSeq
```{r}
ddsSalmon <- DESeq(ddsSalmon)
# Extract the countrast of interest - pairwise comparison
myRes.Salmon <- results(ddsSalmon, contrast = c("Group", "Act", "Naive"))
myRes.Salmon.ordered <- myRes.Salmon[order(myRes.Salmon$pvalue), ]
myRes.Salmon.ordered[1:3, ]
```


### Show the results of `padj < 0.05` and `log2FoldChange > 2`
```{r}
myRes.Salmon.s <- with(myRes.Salmon, myRes.Salmon[padj < 0.05 & !is.na(padj) & log2FoldChange > 2, ])
myRes.Salmon.s[1:3, ]
```


### Test the consistency between DEGs detected by `summarizeOverlaps` and `Salmon`
```{r}
library(ggvenn)
ggvenn(data = list(summarizeOverlaps = rownames(myRes.DF.s), 
                   Salmon = rownames(myRes.Salmon.s)))
```


## Add gene names or symbols
```{r}
library(org.Mm.eg.db) # Load mouse organism annotation database; it contains GO term and ENTREZ IDs
eToSym <- select(org.Mm.eg.db,
                 keys = rownames(myRes.DF.s),
                 keytype = "ENTREZID",
                 columns = "SYMBOL")
eToSym[1:6, ]
```


```{r}
myRes.annotated <- merge(x = eToSym, 
                         y = myRes.DF.s,
                         by.x = 1, # We are interested in the first column of eToSym
                         by.y = 0, # We are interested in the rownames of myRes.DF.s
                         all.x = F, # Non-matching cases in eToSym will not appear in the merged dataset
                         all.y = T) # Non-matching cases in myRes.DF.s will appear in the merged dataset
myRes.annotated[1:3, ]
```


# Exercise - work with RNAseq data of adult mouse liver and heart tissues
## Get counts into DESeq2
```{r}
library(tximport)
# sf.path <- dir(path = "data/Salmon_Tissue/", recursive = T, pattern = "quant.sf", full.names = T)
# salmonCounts <- tximport(files = sf.path, type = "salmon", tx2gene = Tx2Gene)

# Load the DESeqDataSet
load("data/geneCounts_Tissue.RData")
colData(geneCounts_Tissue)$Tissue <- factor(c("Heart","Heart","Liver","Liver"))
dds <- DESeqDataSet(geneCounts_Tissue, design = ~Tissue)
```


Add 0.25 to every count and plot a boxplot of the log2 of these updated counts across 4 sample replicates.
```{r}
library(RColorBrewer)
boxplot(log2(counts(dds, normalized = F) + 0.25),
        names = c("Heart1","Heart2","Liver1","Liver2"), col = brewer.pal(n = 4, name = "Paired"))
```


## Normalize counts
Run the DEseq workflow function and retrieve normalized counts. Add 0.25 to every normalized count and plot a boxplot of the log2 of these updated counts.
```{r}
dds.tissue <- DESeq(dds)
boxplot(log2(counts(dds.tissue, normalized = T) + 0.25),
        names = c("Heart1","Heart2","Liver1","Liver2"), col = brewer.pal(n = 4, name = "Accent"))
```


## Dispersion
Plot the dispersion fit and estimate the variance for our DEseq2 object.
```{r}
plotDispEsts(dds.tissue)
```


## Multiple testing
Add a new `padj` value for all genes. Make a barplot of the number of significantly (`padj` < 0.05) up and down regulated genes by the original `padj` values and our new `padj` values.
```{r}
library(ggplot2)
myRes.DF <- results(dds.tissue, contrast = c("Tissue", "Heart", "Liver")) |> as.data.frame()
myRes.DF$newPadj <- p.adjust(myRes.DF$pvalue, method = "BH")
myRes.DF[1:6, ]
myRes.DF <- within(myRes.DF, {
  newPadj <- ifelse(newPadj < 0.05 & log2FoldChange > 0, "HeartUp",
                       ifelse(newPadj < 0.05 & log2FoldChange < 0, "HeartDown", "NoChange"))
  oriPadj <- ifelse(padj < 0.05 & log2FoldChange > 0, "HeartUp",
                    ifelse(padj < 0.05 & log2FoldChange < 0, "HeartDown", "NoChange"))
  })

data.frame(Method = rep(c("All", "Filtered"), each = nrow(myRes.DF)), padj = c(myRes.DF$newPadj, myRes.DF$oriPadj)) |>
  ggplot(mapping = aes(x = padj, fill = Method)) + 
  geom_bar(position = "dodge") + 
  scale_x_discrete(limits = c("NoChange", "HeartDown", "HeartUp", NA)) + 
  theme_bw()
```


## MA plots
Produce an MA plot before and after logFC shrinkage.
```{r}
myRes <- results(dds.tissue, contrast = c("Tissue", "Heart", "Liver"))
plotMA(myRes) 
```


```{r}
resultsNames(dds.tissue) 
myRes.lfc <- lfcShrink(dds.tissue, coef = "Tissue_Liver_vs_Heart")
plotMA(myRes.lfc)
```