Intro-to-scRNAseq/lessons/seurat_cheatsheet.Rmd at master · yana-kel/Intro-to-scRNAseq · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
---
title: "Seurat Cheatsheet"
output:
   html_document:
      code_folding: show
      df_print: paged
      highlights: pygments
      number_sections: true
      self_contained: true
      theme: default
      toc: true
      toc_float:
         collapsed: true
         smooth_scroll: true
---

This Seurat cheatsheet is meant to provide a summary of the different functionalities of Seurat. This includes how to access certain pieces of information, handy function, and visualization functions built into the package. We have pulled together all of this information with examples using the dataset used throughout this workshop so that there are clear visuals on what the use case of each function is.

These materials were developed by referencing the following pages from the Seurat website:

```{r warning=FALSE, message=FALSE}
library(Seurat)
library(tidyverse)
```

# Dataset

```{r}
load(bzfile("data/additional_data/seurat_integrated.RData.bz2"))
```

Based on the following resources from the seurat webpage:

- https://satijalab.org/seurat/articles/essential_commands.html
- https://satijalab.org/seurat/articles/visualization_vignette.html


# Dataset

Load in the Seurat `integrated_seurat` object that is available in the `addtional_data` folder from [data](https://www.dropbox.com/s/vop78wq76h02a2f/single_cell_rnaseq.zip?dl=1) provided for the workshow.

```{r warning=FALSE, message=FALSE}
library(Seurat)
library(tidyverse)

load(bzfile("data/additional_data/seurat_integrated.RData.bz2"))
```


# Basic information on your seurat object

## Cell barcodes

Within seurat, there are multiple different ways to access the cell barcode IDs. If you recall, seurat stores your count matrix as cells (columns) x genes (rows).

Therefore we can call the `colnames()` function to get a vector of cell barcodes in the same order as they appear in the seurat object.

Recall that when we merged the stim and ctrl datasets into seurat, we specified that we wanted a "stim_" or "ctrl_" prefix before each cell barcode.

```{r}
colnames(seurat_integrated) %>% head()
```

Similarly we can call the `Cells()` function within seurat to get the same output.

```{r}
Cells(seurat_integrated) %>% head()
```

It is **very important** that the values stores in `Cells()` is the same as the rownames in your `meta.data` object or seurat will start throwing errors at you!

```{r}
all(rownames(seurat_integrated@meta.data) == Cells(seurat_integrated))
```


## Features/genes

Now we want to be able to access the rows, or genes, in our seurat object. Rather than calling these values "genes", many tools will call them "features" as different assays (CITE-seq, ATAC-seq) provide alternative information that genes as output.

So we can use the `Features()` function to get a vector of all features/genes in our dataset in the same order as it appears in the seurat object.

```{r}
Features(seurat_integrated) %>% head()
```

The `rownames()` function provides the same output.

```{r}
rownames(seurat_integrated) %>% head()
```

## Number of cells and genes/features

We can access the number of cells by using the `ncol()` function.

```{r}
ncol(seurat_integrated)
```

Similarly, we get the number of features with the `nrow()` function.

```{r}
nrow(seurat_integrated)
```

The `dim()` function provides both the number of cells and genes for the **default assay**. Here we see the number of features following by the number of cells.

```{r}
dim(seurat_integrated)
```

***

**Exercise**

Show code for how you could view the last 5 cells barcodes and the last 5 genes in the integrated seurat object.

***

# Idents

In Seurat, each cell can have a label which can be accessed using the `Idents()` function. These are the default labels used for each cell and are used interally by Seurat plotting functions.

Common information set as the identity for cells include: clusters (as in our example dataset), celltype, sample, etc. You'll notice that identities are automatically stored as factors, which means we can re-organize the levels at any point to change their order for plotting purposes.

```{r}
Idents(seurat_integrated) %>% head()
```

## Rename Idents

To quickly make modifications to identities, you can use the `RenameIdents()` function with identities mapped to new values to change the identities. This is particularly helpful when annotating your cells from clusters to celltypes as showcased here. Bear in mind that these new identities are not stores in the `@meta.data` slot. We recommend adding these identities as a new column in the seurat object to keep track of it for future use.

```{r}
# Rename all identities
seurat_integrated <- RenameIdents(object = seurat_integrated,
                               "0" = "Naive or memory CD4+ T cells",
                               "1" = "CD14+ monocytes",
                               "2" = "Activated T cells",
                               "3" = "CD14+ monocytes",
                               "4" = "Stressed cells / Unknown",
                               "5" = "CD8+ T cells",
                               "6" = "Naive or memory CD4+ T cells",
                               "7" = "B cells",
                               "8" = "NK cells",
                               "9" = "CD8+ T cells",
                               "10" = "FCGR3A+ monocytes",
                               "11" = "B cells",
                               "12" = "NK cells",
                               "13" = "B cells",
                               "14" = "Conventional dendritic cells",
                               "15" = "Megakaryocytes",
			       "16" = "Plasmacytoid dendritic cells")

# These new celltype values are only stored in the idents
# So good practice is to store these changes in a column
seurat_integrated$celltype <- Idents(seurat_integrated)
```


***

**Exercise**

Show code for how you could view the last 5 identities for the cells in the integrated seurat object.

***

# Highly variable features

## Accessing variable features

To get a vector of all highly variable genes that were selected after running `FindVariableFeatures()`, we can use the `VariableFeatures()` function.

```{r}
VariableFeatures(seurat_integrated) %>% head()
```

## Setting variable features

Using the same `VariableFeatures()` function, we can set our own custom set of genes as our highly variable genes.

For example, maybe you want to omit mitochondrial genes from your list of variable genes.

```{r}
# Get list of all variable genes
# Remove variable genes that start with MT-
var_genes <- VariableFeatures(seurat_integrated)
var_genes <- var_genes[!startsWith(var_genes, "MT-")]

# Now we set our vector of gene names back to VariableFeatures()
VariableFeatures(seurat_integrated) <- var_genes
```

***

**Exercise**

Show code for how you could view the last 5 least variable genes in the integrated seurat object.

***

# Assays and layers

## Assays

Within a seurat object you can have multiple "assays". Each assay contains its own count matrix that is separate from the other assays in the object. This structure was created with multimodal datasets in mind so we can store, for example, ATAC peaks within the same seurat object as your RNA counts.

SCTransform also makes use of these assays to store the normalized matrix in a separate assay called "SCT".

To access the list of assays in your seurat object, you can call `@assays`.

```{r}
seurat_integrated@assays
```

We can additionally see which of the assays in our dataset is set as the default with the `DefaultAssays()` function. This is helpful information to know which counts matrix is being accessed when we use other seurat functions by default.

```{r}
DefaultAssay(seurat_integrated)
```

Here we can see that the default assay is set to "integrated". If we instead wanted to primarily use the RNA counts, we can set a new default by once again calling the `DefaultAssay()` function.

```{r}
# Set new default assay
DefaultAssay(seurat_integrated) <- "RNA"

# Print out the new default to see if it changed
DefaultAssay(seurat_integrated)
```

We can access each assay as if the seurat object was a named list with double brackets:

```{r}
seurat_integrated[["SCT"]]
```

And similarly run any function on it:

```{r}
dim(seurat_integrated[["integrated"]])
```

***

**Exercise**

What are the dimensions for each assay in the integrated seurat object?

***

## Layers

Layers are different counts matrices that you can access within each assay (prior to Seurat version 5, this feature was known as "slots").

Following the standard seurat workflow should give you the following matrices:

- counts (raw counts matrix)
- data (normalized count matrix (generated after `SCTransform()` or `NormalizeData()`))
- scaled.data (output from the `ScaleData()`)

We can see which layers are accessible with the `Layers()` function.

```{r}
Layers(seurat_integrated[["RNA"]])
```

In this object we can see that we do not have the scaled.data layer currently. So if we run `ScaleData()` we will be able to access this layer/matrix.

```{r}
seurat_integrated <- ScaleData(seurat_integrated)
Layers(seurat_integrated)
```

## Accessing full count matrix

You can grab the entire counts matrix by making use of the `LayerData()` function.

```{r}
# Subsetting to the first 5 genes and cells to easy viewing
LayerData(seurat_integrated, assay="RNA", layer="counts")[1:5, 1:5]
```

***

**Exercise**

Show the code to get the entire SCT normalized (data) count matrix.

***

## Accessing specific features and metadata

The `FetchData()` function is useful to directly accessing the count for a particular feature of each cell in your object or a single metadata column. You can also specify the layer and assay to specify which piece of information you want.

```{r}
# Normalized counts for the gene PTPRC in the assay SCT
FetchData(seurat_integrated, vars=c("PTPRC", "sample"), assay="SCT", layer="data") %>% head()
```

Conveniently, you can also get information from multiple assays at the same time. To do so, you prepend the assay name (in lowercase format) for the feature you supply to the `FetchData()` function.

```{r}
# Grab the normalized counts in the integrated and RNA assays
FetchData(seurat_integrated, vars=c("rna_PTPRC", "integrated_PTPRC"), layer="data") %>% head()
```

***

**Exercise**

Show how you would use the `FetchData()` function to generate a dataframe of UMAP_1, UMAP_2, and sample values for each cell.

***

# Accessing dimensional reductions

## PCA

The scores for each PC is stored within the embeddings slot of the seurat object. These can be accessed by uisng the `Embeddings()` function.

```{r}
# Alternative method of accessing PCA values
# seurat_integrated[['pca']]@cell.embeddings

Embeddings(seurat_integrated, reduction="pca")[1:5, 1:5]
```

The weight (loadings) for each feature is also stored and can be accessed with `Loadings()` function.
```{r}
# pbmc[['pca]]@feature.loadings
Loadings(seurat_integrated, reduction="pca")[1:5, 1:5]
```

We can also view more information about the top PCs, like the genes that are most strongly correlated with the first few PCs with the `ProjectDim()` function.

```{r}
ProjectDim(seurat_integrated, reduction="pca")
```

## UMAP/tSNE

To access the coordinates used for UMAP/tSNE plots, we specify the reduction of interest in the `Embeddings()` function.

```{r}
# seurat_integrated[['umap']]@cell.embeddings
Embeddings(seurat_integrated, reduction="umap") %>% head()
```


# Data visualization

Underneath the hood, all of Seurat's plotting functions make use of ggplot which means that we can add more details to our plots using ggplot functions.

## DimPlot

The `DimPlot()` function allows us to visualize metadata that is categorical on different reductions (PCA, UMAP).

By default `DimPlot()` will color cells by the `Idents()` and use UMAP as the default reduction.

```{r}
DimPlot(seurat_integrated) + ggtitle("Seurat clusters")
```

We can specify a different metadata column using the `group.by` argument

```{r}
DimPlot(seurat_integrated, group.by="sample")
```

We can also use the `split.by` argument to create multiple plots that only show cells that have the same value for the metadata column specified.

```{r fig.width=11}
DimPlot(seurat_integrated, split.by="sample", group.by="Phase")
```


## FeaturePlot

The `DimPlot()` function allows us to visualize both metadata and features that are continuous on different reductions (PCA, UMAP).

```{r fig.width=12}
FeaturePlot(seurat_integrated, features = c("FCGR3A", "MS4A7"))
```

We can additionally `order` the values in a way that cells with higher values are shown in front (to avoid other cells drowning out the).

To identify cells that show the highest expression of a feature, we can set a `min.cutoff` based upon quantiles, where cells below the the threshold will show no expression.

```{r fig.width=12}
FeaturePlot(seurat_integrated,
            reduction = "umap",
            features = c("FCGR3A", "MS4A7"),
            order = TRUE,
            min.cutoff = 'q10')
```

We can also add labels onto our UMAP to easily identify which groups of cells we are seeing the expression using the `LabelClusters()` function. The parameters show here put a white background behind the text to make it easier to see the labels.

```{r}
Idents(seurat_integrated) <- "integrated_snn_res.0.8"
p <- FeaturePlot(seurat_integrated,
            reduction = "umap",
            features = "FCGR3A",
            order = TRUE,
            min.cutoff = 'q10')
LabelClusters(p, id = "ident",  fontface = "bold", size = 3, bg.colour = "white", bg.r = .2, force = 0)
```

## FeatureScatter

`FeatureScatter()` creates a scatterplot of expression values for two features with each cell being colored by the ident. Bear in mind that you can also specify a continuous metadata column and not just 2 genes/features.

```{r}
Idents(seurat_integrated) <- "celltype"
FeatureScatter(seurat_integrated, feature1 = "MT-ND5", feature2 = "mitoRatio") + ggtitle("MitoRatio vs MT-ND5 expression")
```

## CellScatter

To visualize the differences between two specific cells, you can use the `CellScatter()` function to get a scatterplot of values for each feature in both cells.

```{r}
cell1 <- Cells(seurat_integrated)[1]
cell2 <- Cells(seurat_integrated)[2]

# Here we can see th emetadata for the first two cells in teh dataset
# We are comparing "Activated T cell" vs "CD14+ monocytes" (so they should be very different)
seurat_integrated@meta.data %>% subset(cells %in% c(cell1, cell2)) %>% select(sample, celltype)
```

```{r}
CellScatter(seurat_integrated, cell1=cell1, cell2=cell2)
```

## VlnPlot

We can create violin plot to compare the distribution of gene expression across different populations using the `VlnPlot()` function.

This is a very customizable function, with many parameters to customize the look of the plots.

```{r}
VlnPlot(seurat_integrated, c("CD14", "CD79A"))
```

In this example, I am grouping expression by sample, showing 2 plots per column, and removing the points (cells) by setting their size to 0.

```{r}
VlnPlot(seurat_integrated, c("IFIT1", "CD53", "CD52", "CXCL8"), group.by="sample", ncol=2, pt.size=0)
```

## RidgePlot

Ridge plots are most commonly used with CITE-seq or hashtagged dataset as they provide an easy way to identify cells that express a protein/antibody.

For our scRNA dataset, when we call `RidgePlot()`, on the y-axis we see the unique identities assigned for each cell. The x-axis shows us the expression level for whichever feature we chose. This is a great visualization to use when justifying annotation decisions.

```{r}
RidgePlot(seurat_integrated, "CD3D", assay="RNA") + NoLegend()
```

## DimHeatmap

To see the effect of genes on the principal component, we can see the top and bottom features in PC1 using the `DimHeatmap()` function.

```{r, warning=FALSE}
DimHeatmap(seurat_integrated, nfeatures = 10)
```

## DoHeatmap

To plot the expression values for genes across all cells (grouped by their identity), you can call Seurat's `DoHeatmap()` function to identify which populations certain genes are lowly or hihgly expressed.

```{r fig.width=10}
Idents(seurat_integrated) <- "celltype"
DoHeatmap(seurat_integrated, features=c("CD14", "FCGR3A", "FCER1A", "IL3RA", "CD79A", "CD3D"))
```


## DotPlot

Seurat also has a built in visualization tool which allows us to view the average expression of genes groups clusters called `DotPlot()`. The size of the circle represents the number of cells that express the gene within a group and the hue reprents the average expression of the feature.

If you supply a named list with labels annotating genes, those labels will appear at the top of the plot for easier visualization.

```{r fig.width=15, warning=FALSE}
# List of known celltype markers
markers <- list()
markers[["CD14+ monocytes"]] <- c("CD14", "LYZ")
markers[["FCGR3A+ monocyte"]] <- c("FCGR3A", "MS4A7")
markers[["Macrophages"]] <- c("MARCO", "ITGAM", "ADGRE1")
markers[["Conventional dendritic"]] <- c("FCER1A", "CST3")
markers[["Plasmacytoid dendritic"]] <- c("IL3RA", "GZMB", "SERPINF1", "ITM2C")

# Create dotplot based on RNA expression
DotPlot(seurat_integrated, markers, assay="RNA", group.by = "integrated_snn_res.0.8")
```