NSCLC_scRNA-seq_Preprocessing_Seurat.qmd

---
title: "scRNAseq data analysis pipeline"
author: "Monica L. Rojas-Pena"
format: html
editor: visual
---

# Non-small cell lung cancer (NSCLC) dissociated tumor cells from 7 donors

Data source: <https://www.10xgenomics.com/resources/datasets/10-k-human-pbm-cs-multiome-v-1-0-chromium-controller-1-standard-2-0-0>

Non-small cell lung cancer (NSCLC) dissociated tumor cells from 7 donors were obtained from Discovery Life Sciences. Cells were labeled with TotalSeq™-B Human TBNK Cocktail (BioLegend). Each donor was CellPlexed and pooled at equal proportions. Viable cells in the pool were identified by 7AAD staining and sorted via FACS.

Gene Expression and CellPlex libraries were generated from \~33,000 cells as described in the Chromium Single Cell 3' Reagent Kits User Guide (v3.1 Chemistry Dual Index) with Feature Barcode technology for Cell Surface Protein and Cell Multiplexing (CG000390 Rev B) using the Chromium X and sequenced on an Illumina NovaSeq 6000 to a read depth of approximately 70,000 mean reads per cell for Gene Expression and 25,000 mean reads per cell for CellPlex.

Paired-end, dual indexing:

-   Read 1: 28 cycles (16 bp barcode, 12 bp UMI)

-   i5 index: 10 cycles (sample index)

-   i7 index: 10 cycles (sample index)

-   Read 2: 90 cycles (transcript)

Analysis parameters used: **`expect-cells=20000`**

Pooled multiplexed sample - Key metrics:

-   Estimated number of cells: 16,443

-   Cells assigned to a sample: 12,231

```{r}
# This is an script to perform standard workflow steps to analyze single cell RNA-Seq data using the Seurat package
# about the data data: 20k Mixture of NSCLC DTCs from 7 donors, 3' v3.1

# setwd("~/Documents/Regresando a ser Computational_Biologist/scRNA/data_sets")
#Install packages 
#remotes::install_github("satijalab/seurat", "seurat5", quiet = TRUE)
# Install tidyverse from CRAN
#install.packages("tidyverse")

# load libraries
library(Seurat)
library(tidyverse)
```

Reading the count data

```{r}
# Load the NSCLC dataset
nsclc.sparse.m <- Read10X_h5(filename = '20k_NSCLC_DTC_3p_nextgem_Multiplex_count_raw_feature_bc_matrix.h5')

#Exploring the file 
str(nsclc.sparse.m)

#extraction only the gene expression info
counts <-  nsclc.sparse.m$`Gene Expression`

counts[1:10,1:10]
```

Exploring the dataset we can see a message "Genome matrix has multiple modalities, returning a list of matrices for this genome" and is running a list matrices for this genome. The modalities present are: Gene expression, Antibody capture and Multiplexing capture. We are only interested in the Gene expression modality, so we are saving it in the variable counts.

Then exploring the counts file (firts 10 rows and first 10 columns), we should see a count matrix in a form of sparce matrix where the rows are features/genes and the columns are barcodes.

In the next step we will be creating the Seurat object using CreateSeuratObject, where the first parameter is our counts matrix, the project name is going to be NSCL, and we have some additional parameter like min.cells where we are keeping only features that have at least 3 cells, and finally we have min.features set as 200 which will allow us to keep the cells that have at least 200 features.

```{r}

# Initialize the Seurat object with the raw (non-normalized data).
nsclc.seurat.obj <- CreateSeuratObject(counts = counts, project = "NSCLC", min.cells = 3, min.features = 200)

str(nsclc.seurat.obj)

nsclc.seurat.obj
```

The Seurat Object contains 29552 features across 42081 samples before quality control.

Now that the Seurat Object is created, we perform quality control, in this section we are going to filter out low quality cell, for this we have in account the number of features/genes in each cell, and the number of total molecules (nCount number of transcripts) which gives the number of molecules in each cell, this give us an idea on weather the cell is a poor quality (low number of genes and low number of molecules), on the other end we have to look for really high number of genes or molecules which can be an indication of doublets, or multiple cells sequences together and is label as a single cell.

```{r}
#checking how the metadata set looks like, this show the nCount_RNA and nFeature columns already calculated

View(nsclc.seurat.obj@meta.data)

```

We also need to look at the percentage of mitochondrial genes, low quality cells usually contain a higher percentage, we use PercentageFeatureSet.

```{r}
# Quality control

# % With this step with add another column to the metadata with the percentage of mitochondrial genes (MT reads).
nsclc.seurat.obj[["percent.mt"]] <- PercentageFeatureSet(nsclc.seurat.obj, pattern = "^MT-")

#We can see the new column has been creted
View(nsclc.seurat.obj@meta.data)

#Vizualization of this features in the metadata
VlnPlot(nsclc.seurat.obj, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

#Visializing metric togheter
FeatureScatter(nsclc.seurat.obj, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + geom_smooth(method = 'lm')

```

Visualizing the metadata in a violin plot shows us that there are a lot of cells that have different number of genes and is spread over a spectrum (left panel), we also have a lot of cells having higher number of molecules detected (middle panel) and we see that a lot of cells have a high percentage of the mitochondrial reads (right panel), these need to be filter since it is an indication of low quality.

But looking at this metrics separatelly can be misleading, so we need to look at them together

To look at the metrics toghther we use FeatureScater, which allow us to plot two metrics at different metric in each acces, so we plot number of counts (x axis) and number of features (y axis), good quality cells should have a good number of genes detected and also a good number of molecules detected.

A good quality data should follow a straigh line, this dataset seems to follow an straight line except that some cells plateau a bit. To interpret the quality this figure, if we have low quality cells this will show on the lower right portion of the panel. This is because an experiment only capture few genes and this sequence over and over, in this case we don't see this. In case we see cells on the top right corner of the panel we can say that we have a good amount of genes but they were not sequence enough. In this cases, we should inspect this cells further to verify if it is not an artifact. In this two cases we will need to remove this cells.

Overall this data set has a good quality, but we will need to remove some low quality cells. We are going to remove cells base on the number of genes and high mitochondrial percentage.

There are additional metrics that you have in account like ribosomal genes, but this depend on the sourse of the data is been analyzed and the biological outcome that is expected.

For doublets there is a package called DoubletFinder, to filter out doublets (There are other packages too), in this pipeline we are not using it.

In the next line we are filtering out the cell with more than 200 genes and less than 2500 genes, and with a mitochondrial percentage less than 5%. One can try different combinations of this parameters

```{r}
# Filtering and updating the Seurat Object
nsclc.seurat.obj <- subset(nsclc.seurat.obj, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

nsclc.seurat.obj
```

After quality control we end up with 29552 features across 24708 samples.

Now, in order to compare the gene expression accross multiple cells we normalize the data. For this divide the gene expression in each cell and divide by the total expression and the multiple by a scaling factor and then log transform it. Seurat package has a function that perform this normalization NormalizeData, the default values are normalization.method = "LogNormalize", scale.factor = 10000.

```{r}
#Normalize data
#if we want to change the parameters, buy this are the ones by default
#nsclc.seurat.obj <- NormalizeData(nsclc.seurat.obj, normalization.method = "LogNormalize", scale.factor = 10000)
# OR
nsclc.seurat.obj <- NormalizeData(nsclc.seurat.obj)
#this normalization command and the parameters used will be safe into the Seurat Object, any furder command will be appended to this slot from this part on
str(nsclc.seurat.obj)
```

Once data is normalized the next step if to identify highly variable features. We only want to select a subset of features that exhibit high cell-to-cell variation. It has been shown that focusing on this subset of genes in downstream analysis will highlight the biological signal in the scRNA-seq data set. We find this features using the function FindVariableFeatures.

```{r}
# Identify highly variable features (vst is the default method, and 2000 features)
nsclc.seurat.obj <- FindVariableFeatures(nsclc.seurat.obj, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10.variablefeatures <- head(VariableFeatures(nsclc.seurat.obj), 10)
top10.variablefeatures

# Visualizing variable features with and without labels, using VariableFeaturePlot
plot.Varaiblefeatures <- VariableFeaturePlot(nsclc.seurat.obj)
LabelPoints(plot = plot.Varaiblefeatures, points = top10.variablefeatures, repel = TRUE)


```

In single cell there are many unwanted sources of variation and this could be either technical (i.e. batch) or biological sources (i.e. cell cycle), so we want to account for this variation so our cells don't cluster based on these sources of variation, but due to a true biological effect.

Here we use a function called ScaleData which regresses all these sources of variation, In this case I am just regressing the data using the default parameters.

After scaling, we perform PCA to reduce dimensionality in the data to identify sources of heterogeneity.

```{r}
# Scaling 
#Using all the genes as features for the scaling step
all.genes <- rownames(nsclc.seurat.obj)
nsclc.seurat.obj <- ScaleData(nsclc.seurat.obj, features = all.genes)

str(nsclc.seurat.obj) #count: raw data, data: log normalized counts, scale.data: scaled data 

#Linear dimensionality reduction PCA (you can select specific features), this will give you the top 5 PC scores
nsclc.seurat.obj <- RunPCA(nsclc.seurat.obj, features = VariableFeatures(object = nsclc.seurat.obj))

# visualize PCA results in a heatmap, choosing top 5 features
print(nsclc.seurat.obj[["pca"]], dims = 1:5, nfeatures = 5)
DimHeatmap(nsclc.seurat.obj, dims = 1, cells = 500, balanced = TRUE)
```

We can explore and play around with the PC results to choose the one ones we can use for downstream analysis.

Now we determine the dimensionality of the data by choosing the statistically significant PC that capture the majority of the signal (heterogeneity) and we only consider those PC for downstream, the method that we are using here is the elbow plot. Based on the figure we choose the PC that explain higher percentage of the variance.

```{r}
#| label: gg-oz-gapminder
#| fig-cap: "Elbow plot showing the all the principal componets ranked by the percentage of variance explained. On the X axis we have the principal componets (PC), and on the Y axis the standar deviation."

# determine dimensionality of the data
ElbowPlot(nsclc.seurat.obj)
```

In this case PC we are choosing the PC from 1 to 15 for downstream analysis. Since the variability doesn't change much after this PC.

The next step is to cluster similar cell together, the cells will be cluster based on the similarity of the pattern. For this we first fin the neighbors using the function FindNeighbors. After we fin the clusters by chossing the best resolution.

```{r}

# Clustering, dimetion is the PC we choose (15 in this case)
nsclc.seurat.obj <- FindNeighbors(nsclc.seurat.obj, dims = 1:15)

# Now we finf the clusters, understanding resolution (the fewer the number less cluster, the highest more clusters), here we start we the lowest and the increase.
nsclc.seurat.obj <- FindClusters(nsclc.seurat.obj, resolution = c(0.1,0.3, 0.5, 0.7, 1))
View(nsclc.seurat.obj@meta.data)

#here we are looking at a 0.1 resolution (larger data set may require a higher resolution)
DimPlot(nsclc.seurat.obj, group.by = "RNA_snn_res.0.1", label = TRUE)

# setting identity of clusters (we choose resulution of 0.1)
Idents(nsclc.seurat.obj)
Idents(nsclc.seurat.obj) <- "RNA_snn_res.0.1"
Idents(nsclc.seurat.obj)


```

After this linear dimensionality reduction the seurat package offer t-sne or UMAP for a non-linnear dimensional reduction. The goal is to group cell of similar type, so we can further explore and visualize the data. Here we are using UMAP

```{r}
# non-linear dimensionality reduction 
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
nsclc.seurat.obj <- RunUMAP(nsclc.seurat.obj, dims = 1:15)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(nsclc.seurat.obj, reduction = "umap")
```

```{r}
sessionInfo()
```