v2.1.Integration.UMAP.Rmd

---
title: "v2 Integrating Data and Generating UMAP"
author: "D. Ford Hannum Jr."
date: "8/24/2020"
output: 
        html_document:
                toc: true
                toc_depth: 3
                number_sections: false
                theme: united
                highlight: tango
---

```{r setup, include=TRUE, message=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE)
library(Seurat)
library(ggplot2)
library(data.table)
library(MAST)
library(SingleR)
library(dplyr)
library(tidyr)
library(limma)
```

```{r printing session info, include = T}
sessionInfo()
```

# Introduction

In v2 of the analysis we decided to include the control mice from the Nbeal experiment with the Migr1 and Mpl mice. The thought is that it may be good to have another control, since the Migr1 control has irradiated and had a bone marrow transplantation. I'm going to split the Rmarkdown files into separate part, to better organize my analysis.

## This File

Here the control from the Nbeal and the data from the Migr1/Mpl experiment are going to be integrated together and we are going to do all the normalization, scaling, and data reduction. I'm following the [integration guide](https://satijalab.org/seurat/v3.2/integration.html) from the Satija Lab and some of my previous files (integrating_with_hwbm_to...)

# Loading the data and getting HTO labels

Loading the raw matrices supplied by 10X for the two experiment and find the cells that have an HTO

## Migr1/Mpl Experiment

```{r loading data from 10X migr1/mpl}
wbm.data <- Read10X(data.dir = './data/filtered_feature_bc_matrix//')

# Creating Seurat Object

wbm <- CreateSeuratObject(counts = wbm.data, project = 'Mpl', min.cells = 3, min.features = 200)

# Reading in the HTO labels provided by Dr. Brian Parkin

htos <- read.csv('./data/HTOs.csv')

#dim(htos)

# Removing all the cells without a label

htos <- htos[htos$HTOs != '',]

#dim(htos)

# Adding the conditioni from the HTO tagged, from an e-mail from Priya

htos$condition <- ifelse(htos$HTOs == 'HTO-1', 'Mpl',
                         ifelse(htos$HTOs == 'HTO-2', 'enrMpl',
                                ifelse(htos$HTOs == 'HTO-3', 'Migr1','enrMigr1')))

# Subsetting the two data frames so they only include cells that overlap

wbm <- wbm[,colnames(wbm) %in% htos$Barcode]

htos <- htos[htos$Barcode %in% colnames(wbm),]

# Making sure the cell order is maintained between the two dataframes, so I can
# just add the condition to the meta data

#summary(rownames(wbm@meta.data) == htos$Barcode)

# Adding the condition to the meta data

wbm@meta.data$Condition <- htos$condition

```

We started out with **11,278** cells in the experiment. After a basic filtering when creating the Seurat object (minimum of 200 features, also filtered for genes that were expressed in a minimum of 3 cells) we had **10,157** cells. There were only **7,542** cells that had an HTO label, of which **7,228** passed the basic Seurat filtering.

```{r setting cutoffs for cell inclusion migr1/mpl}
# Adding a variable for percentage of features coming from mitochondrial genes

wbm[['percent.mt']] <- PercentageFeatureSet(wbm, pattern = '^mt')

ggplot(data = wbm@meta.data, aes(x = nCount_RNA)) + geom_density() +
        geom_vline(xintercept = 50000, col = 'red', linetype = 2) +
        theme_bw()

ggplot(data = wbm@meta.data, aes(x = nFeature_RNA)) + geom_density() +
        geom_vline(xintercept = c(500,5000), col = 'red', linetype = 2) +
        theme_bw()

ggplot(data = wbm@meta.data, aes(x = percent.mt)) + geom_density() +
        geom_vline(xintercept = 10, col = 'red', linetype = 2) +
        theme_bw()

#summary(wbm[['nCount_RNA']] < 50000)
#summary(wbm[['nFeature_RNA']] > 500 & wbm[['nFeature_RNA']] < 5000)
#summary(wbm[['percent.mt']] > 10)

# nCount_RNA.passed.cells <- colnames(wbm[,wbm[['nCount_RNA']] < 50000])
# nFeature_RNA.passed.cells <- colnames(wbm[,wbm[['nFeature_RNA']] < 5000 & wbm[['nFeature_RNA']] > 500])
# mt.percent.passed.cells <- colnames(wbm[,wbm[['percent.mt']] < 10])

# Creating a Venn diagram to illustrate this

#head(wbm@meta.data)
wbm[['nCount_RNA.passed']] <- wbm[['nCount_RNA']] < 50000
wbm[['nFeature_RNA.passed']] <- wbm[['nFeature_RNA']] < 5000 & wbm[['nFeature_RNA']] > 500
wbm[['percent.mt.passed']] <- wbm[['percent.mt']] < 10

venn.counts <- vennCounts(wbm@meta.data[,c('nCount_RNA.passed','nFeature_RNA.passed','percent.mt.passed')])

vennDiagram(venn.counts)

# Lesser cutoff suggested by Jun to see where those other cells lie
less.subset.wbm <- subset(wbm, subset = nFeature_RNA > 500 & percent.mt < 10)

wbm <- subset(wbm, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 &
                    percent.mt < 10 & nCount_RNA < 50000)
#dim(wbm)
``` 

Dashed lines indicate cutoffs I used for subsetting the Seurat dataset. We end up with **6,121** cells for the Migr1/Mpl experiment. Venn diagram is just for our use, and can be made nicer if we wish to publish something like that.

## Nbeal Controls

```{r nbeal data}
cnt.data <- Read10X(data.dir = './data/Experiment2/filtered_feature_bc_matrix/')

cnt <- CreateSeuratObject(counts = cnt.data, project = 'Nbeal', min.cells = 3, min.features = 200)

# Getting the HTOs

nbeal_hto <- read.table('./data/Experiment2/hto_labels.txt')

nbeal_hto <- nbeal_hto[nbeal_hto$V2 %in% c('HTO3','HTO4'),]

nbeal_hto$condition <- ifelse(nbeal_hto$V2 == 'HTO3', 'Nbeal_cntrl', 'enrNbeal_cntrl')

nbeal_hto$cell <- paste0(nbeal_hto$V1, '-1')

# summary(nbeal_hto$cell %in% colnames(cnt))

cnt <- cnt[,colnames(cnt) %in% nbeal_hto$cell]

# Making sure the cell order is maintained between the two dataframes, so I can
# just add the condition to the meta data

#summary(rownames(wbm@meta.data) == htos$Barcode)

# Adding the condition to the meta data

cnt@meta.data$Condition <- nbeal_hto$condition

```

We started **6,376** that passed basic Seurat filtering, and **1,991* were labeled with either the HTO3 or HTO4 tag, which indicates they came from the control mouse. 

```{r filtering steps for nbeal_cnt}
# Adding a variable for percentage of features coming from mitochondrial genes

cnt[['percent.mt']] <- PercentageFeatureSet(cnt, pattern = '^mt')

ggplot(data = cnt@meta.data, aes(x = nCount_RNA)) + geom_density() +
        geom_vline(xintercept = 50000, col = 'red', linetype = 2) +
        theme_bw()

ggplot(data = cnt@meta.data, aes(x = nFeature_RNA)) + geom_density() +
        geom_vline(xintercept = c(500,5000), col = 'red', linetype = 2) +
        theme_bw()

ggplot(data = cnt@meta.data, aes(x = percent.mt)) + geom_density() +
        geom_vline(xintercept = 10, col = 'red', linetype = 2) +
        theme_bw()

# Creating a Venn Diagram

cnt[['nCount_RNA.passed']] <- cnt[['nCount_RNA']] < 50000
cnt[['nFeature_RNA.passed']] <- cnt[['nFeature_RNA']] < 5000 & cnt[['nFeature_RNA']] > 500
cnt[['percent.mt.passed']] <- cnt[['percent.mt']] < 10

venn.counts <- vennCounts(cnt@meta.data[,c('nCount_RNA.passed','nFeature_RNA.passed','percent.mt.passed')])

vennDiagram(venn.counts)

table(cnt$Condition)
table(cnt$nFeature_RNA > 500)
table(cnt$percent.mt.passed)

less.cnt <- subset(cnt, subset = nFeature_RNA > 500 & percent.mt < 10)

table(less.cnt$Condition)

cnt <- subset(cnt, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 &
                    percent.mt < 10 & nCount_RNA < 50000)
```

We end up with **1,754** control cells from the Nbeal experiment (normal and enriched).

# Integrating the datasets

```{r integration steps}
comb <- merge(x = wbm, y = cnt, 
              add.cell.ids = c('Mpl','Nbeal'),
              project = "WBM_combo",
              merge.data = T)

comb.list <- SplitObject(comb, split.by = 'orig.ident')


for(i in 1:length(comb.list)){
        
        # Normalizing the data separately for each experiment
        comb.list[[i]] <- NormalizeData(comb.list[[i]], verbose = F)
        
        # Finding hte top 2000 most variable features in each experiment
        comb.list[[i]] <- FindVariableFeatures(comb.list[[i]], 
                                               selection.method = 'vst',
                                               nfeatures = 2000,
                                               verbose = F)
}

ref.list <- comb.list

# This may be something I should play around with...

comb.anchors <- FindIntegrationAnchors(object.list = ref.list, dims = 1:30)

comb.int <- IntegrateData(anchorset = comb.anchors, dims = 1:30)

DefaultAssay(comb.int) <- 'integrated'

comb.int <- ScaleData(comb.int, verbose = F)
comb.int <- RunPCA(comb.int, npcs = 30, verbose = F)

ElbowPlot(comb.int, ndims = 30) # Going with 20
print("Going with the first 20 PCs")

comb.int <- FindNeighbors(comb.int, dims = 1:20)

# Looking at multiple clustering resolutions
res <- seq(0,1, by = .05)
cluster_list <- c()

for (i in res){
        # Finding clusters for resolution i
        clusters <- FindClusters(comb.int, resolution = i)
        
        # Counting the number of clusters
        clusters <- length(levels(clusters$seurat_clusters))
        
        # Appending the number of clusters to a list
        cluster_list <- c(cluster_list,clusters)
}

plot(cluster_list)

print(paste0("Going with index 7 = ", res[7]))

comb.int <- FindClusters(comb.int, resolution = 0.3)

comb.int <- RunUMAP(comb.int, dims = 1:20)

DimPlot(comb.int, reduction = 'umap')
DimPlot(comb.int, reduction = 'umap', split.by = 'orig.ident')
DimPlot(comb.int, reduction = 'umap', group.by = 'orig.ident')
DimPlot(comb.int, reduction = 'umap', group.by = 'Condition')

#head(comb.int@meta.data)
comb.int$enriched <- grepl('enr', comb.int@meta.data$Condition)

DimPlot(comb.int, reduction = 'umap', group.by = 'enriched') +
        ggtitle('Enriched Cells')

#saveRDS(comb.int, './data/v2/combined.integrated.rds')
```

```{r integration steps lesser}
comb2 <- merge(x = less.subset.wbm, y = less.cnt, 
              add.cell.ids = c('Mpl','Nbeal'),
              project = "WBM_combo",
              merge.data = T)

comb2.list <- SplitObject(comb2, split.by = 'orig.ident')


for(i in 1:length(comb2.list)){
        
        # Normalizing the data separately for each experiment
        comb2.list[[i]] <- NormalizeData(comb2.list[[i]], verbose = F)
        
        # Finding hte top 2000 most variable features in each experiment
        comb2.list[[i]] <- FindVariableFeatures(comb2.list[[i]], 
                                               selection.method = 'vst',
                                               nfeatures = 2000,
                                               verbose = F)
}

ref.list <- comb2.list

# This may be something I should play around with...

comb2.anchors <- FindIntegrationAnchors(object.list = ref.list, dims = 1:30)

comb2.int <- IntegrateData(anchorset = comb2.anchors, dims = 1:30)

DefaultAssay(comb2.int) <- 'integrated'

comb2.int <- ScaleData(comb2.int, verbose = F)
comb2.int <- RunPCA(comb2.int, npcs = 30, verbose = F)

ElbowPlot(comb2.int, ndims = 30) # Going with 20
print("Going with the first 20 PCs")

comb2.int <- FindNeighbors(comb2.int, dims = 1:20)

# Looking at multiple clustering resolutions
res <- seq(0,1, by = .05)
cluster_list <- c()

for (i in res){
        # Finding clusters for resolution i
        clusters <- FindClusters(comb2.int, resolution = i)
        
        # Counting the number of clusters
        clusters <- length(levels(clusters$seurat_clusters))
        
        # Appending the number of clusters to a list
        cluster_list <- c(cluster_list,clusters)
}

plot(cluster_list)

print(paste0("Going with index 7 = ", res[7]))

comb2.int <- FindClusters(comb2.int, resolution = 0.3)

comb2.int <- RunUMAP(comb2.int, dims = 1:20)

DimPlot(comb2.int, reduction = 'umap')
DimPlot(comb2.int, reduction = 'umap', split.by = 'orig.ident')
DimPlot(comb2.int, reduction = 'umap', group.by = 'orig.ident')
DimPlot(comb2.int, reduction = 'umap', group.by = 'Condition')

#head(comb2.int@meta.data)
comb2.int$enriched <- grepl('enr', comb2.int@meta.data$Condition)

DimPlot(comb2.int, reduction = 'umap', group.by = 'enriched') +
        ggtitle('Enriched Cells')

table(comb2.int$Condition)
table(comb2.int$orig.ident)

#saveRDS(comb2.int, './data/v2/lesser.combined.integrated.rds')
```