single_cell_RNA/Seurat4.Rmd at master · the8thday/single_cell_RNA · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
---
title: "Seurat4"
author: "liuc"
date: '2022-04-12'
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

> https://www.singlecellcourse.org/index.html
> https://github.com/hbctraining/scRNA-seq_online
> https://hbctraining.github.io/scRNA-seq_online/schedule/links-to-lessons.html

完整而简略的单细胞转录组分析流程，对于多样本数据集。
假设实验组和对照组各四例样本。


```{r, include=FALSE}
library(tidyverse)
library(patchwork)

library(SingleCellExperiment)
library(Seurat)
library(sctransform)
library(clustree)
```


## Complete Seurat4 pipeline

1. 首先是对10X数据的整合

在10X数据下机后，还需要·scrublet·进行doublet detection ·soupX·等进行ambient RNA correction等质控的过程。
不过对于doublet的检测，似乎也不是很着急，以下笔记仅从测试数据开始。

此示例样本包括8个PBMC样本，其中4个为control组，四个为stim刺激过后的样本。

本示例中没有按照每个样本进行分析，而是将每个分组的样本pool在一起，注意这只是一个测试。

NOTE: You should always work with non-pooled samples from the beginning of the scRNA-seq workflow, if possible.

```{r}
# 除merge函数外，还可以通过目录向量进行合并，此处选择merge函数
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
        seurat_data <- Read10X(data.dir = paste0("./datasets/", file))
        seurat_obj <- CreateSeuratObject(counts = seurat_data,
                                         min.features = 100,
                                         project = file)
        assign(file, seurat_obj)
}

# Create a merged Seurat object
merged_seurat <- merge(x = ctrl_raw_feature_bc_matrix,
                       y = stim_raw_feature_bc_matrix,
                       add.cell.id = c("ctrl", "stim"))
```

2. 对Seurat对象的metadata增加所需的样本信息

已知的meta信息

```{r}
# Check the metadata in the new Seurat objects
head(merged_seurat@meta.data)

merged_seurat@meta.data$cells <- rownames(merged_seurat@meta.data)

# Create sample column
merged_seurat@meta.data$sample <- NA
merged_seurat@meta.data$sample[which(str_detect(merged_seurat@meta.data$cells, "^ctrl_"))] <- "ctrl"
merged_seurat@meta.data$sample[which(str_detect(merged_seurat@meta.data$cells, "^stim_"))] <- "stim"

```

计算后的一些meta信息

```{r}
# Add number of genes per UMI for each cell to metadata
merged_seurat$log10GenesPerUMI <- log10(merged_seurat$nFeature_RNA) / log10(merged_seurat$nCount_RNA)

# Compute percent mito ratio
merged_seurat$mitoRatio <- PercentageFeatureSet(object = merged_seurat, pattern = "^MT-")
merged_seurat$mitoRatio <- merged_seurat@meta.data$mitoRatio / 100


# Create .RData object to load at any time
save(merged_seurat, file="datasets//merged_filtered_seurat.RData")
```


3. 对metadata的metrix进行绘图，并确定质控阈值

```{r}
metadata <- merged_seurat@meta.data

# Visualize the number UMIs/transcripts per cell
# 对于UMI或是feature数目的分布，单一的峰往往是较好的结果
metadata %>%
  	ggplot(aes(color=sample, x=nCount_RNA, fill= sample)) +
  	geom_density(alpha = 0.2) +
  	scale_x_log10() +
  	theme_classic() +
  	ylab("Cell density") +
  	geom_vline(xintercept = 500)
```


```{r}
# Visualize the correlation between genes detected and number of UMIs and determine whether strong presence of cells with low numbers of genes/UMIs
metadata %>%
  	ggplot(aes(x=nCount_RNA, y=nFeature_RNA, color=mitoRatio)) +
  	geom_point() +
	scale_colour_gradient(low = "gray90", high = "black") +
  	stat_smooth(method=lm) +
  	scale_x_log10() +
  	scale_y_log10() +
  	theme_classic() +
  	geom_vline(xintercept = 500) +
  	geom_hline(yintercept = 250) +
  	facet_wrap(~sample)

# Visualize the overall complexity of the gene expression by visualizing the genes detected per UMI
# 一般大于0.8为佳
metadata %>%
  	ggplot(aes(x=log10GenesPerUMI, color = sample, fill=sample)) +
  	geom_density(alpha = 0.2) +
  	theme_classic() +
  	geom_vline(xintercept = 0.8)
```


4. filtering
QC质控在scRNA中应该结合多种metric的分布进行确定，有些细胞可能是质控不达标，也可能是生物学意义上本就如此的细胞亚群。不同的质控的理解会对下游分析有些许的影响，下面结合对QC的理解进行过滤。
在过滤过后推荐将上面的QC分析步骤再运行一次，以检验是否达到预期的QC目的。

```{r}
# Cell-level filtering
# Filter out low quality reads using selected thresholds - these will change with experiment
filtered_seurat <- subset(x = merged_seurat,
                         subset= (nCount_RNA >= 500) &
                           (nFeature_RNA >= 250) &
                           (log10GenesPerUMI > 0.80) &
                           (mitoRatio < 0.20)
                         )

# Gene-level filtering
# Output a logical vector for every gene on whether the more than zero counts per cell
# Extract counts
counts <- GetAssayData(object = filtered_seurat, slot = "counts")
# Output a logical vector for every gene on whether the more than zero counts per cell
nonzero <- counts > 0
# Sums all TRUE values and returns TRUE if more than 10 TRUE values per gene
keep_genes <- Matrix::rowSums(nonzero) >= 10
# Only keeping those genes expressed in more than 10 cells
filtered_counts <- counts[keep_genes, ]
# Reassign to filtered Seurat object
filtered_seurat <- CreateSeuratObject(filtered_counts, meta.data = filtered_seurat@meta.data)


# Create .RData object to load at any time
save(filtered_seurat, file="datasets/seurat_filtered.RData")
```


5. Normalization & PCA & clustreing

此处选择SCTransform方法进行normalization、scale等。聚类之前对counts数据进行normalization.


以下操作通过PCA确定cell cycle是否需要在SCTransform中regress out。
```{r}
# Normalize the counts
# 此时的normalization还比较的粗糙
seurat_phase <- NormalizeData(filtered_seurat)

# Identify the most variable genes
seurat_phase <- FindVariableFeatures(seurat_phase,
                     selection.method = "vst",
                     nfeatures = 2000,
                     verbose = FALSE)

# Scale the counts
seurat_phase <- ScaleData(seurat_phase)

# Score cells for cell cycle
seurat_phase <- CellCycleScoring(seurat_phase,
                                 g2m.features = cc.genes.updated.2019$g2m.genes,
                                 s.features = cc.genes.updated.2019$s.genes)

# Perform PCA
seurat_phase <- RunPCA(seurat_phase)

# perform TSNE, UMAP
seurat_phase <- RunUMAP(seurat_phase, dims = 1:20)
# seurat_phase <- RunTSNE(object = seurat_phase, dims = 1:20, do.fast = TRUE)

# Plot the PCA colored by cell cycle phase
DimPlot(seurat_phase,
        reduction = "pca",
        group.by = "Phase",
        split.by = "Phase")
```


```{r}
DefaultAssay(seurat_phase)
```


本例子不需要对细胞周期进行scale，但如果需要的话，Seurat推荐如下的方法，因为直接的回归所有的细胞周期效应也会模糊干细胞和祖细胞之间的区别。
```{r}
seurat_phase$CC.Difference <- seurat_phase$S.Score - seurat_phase$G2M.Score

seurat_phase <- ScaleData(seurat_phase,
                          vars.to.regress = "CC.Difference",
                          features = rownames(seurat_phase))
seurat_phase <- RunPCA(seurat_phase, features = VariableFeatures(marrow),
                       nfeatures.print = 10)
```

以下操作通过PCA确定线粒体基因是否需要在SCTransform中regress out。
```{r}
# Check quartile values
summary(seurat_phase@meta.data$mitoRatio)

# Turn mitoRatio into categorical factor vector based on quartile values
seurat_phase@meta.data$mitoFr <- cut(seurat_phase@meta.data$mitoRatio,
                   breaks=c(-Inf,
                            summary(seurat_phase@meta.data$mitoRatio)[2],
                            summary(seurat_phase@meta.data$mitoRatio)[3],
                            summary(seurat_phase@meta.data$mitoRatio)[4],
                            Inf),
                   labels=c("Low","Medium","Medium high", "High"))

# Plot the PCA colored by cell cycle phase
DimPlot(seurat_phase,
        reduction = "pca",
        group.by= "mitoFr",
        split.by = "mitoFr")
```


6. SCTransform

use the sctransform method as a more accurate method of normalizing, estimating the variance of the raw filtered data, and identifying the most variable genes.

在进行SCTransform时，一般会对cell cycle和线粒体等会对聚类引入variation进行regress out，但如果线粒体本就在研究对象 中有意义的话，就不推荐regerss out了。

```{r}
# Split seurat object by condition to perform cell cycle scoring and SCT on all samples
# ?sctransform::vst
split_seurat <- SplitObject(seurat_phase, split.by = "sample")

split_seurat <- split_seurat[c("ctrl", "stim")]


for (i in 1:length(split_seurat)) {
    split_seurat[[i]] <- SCTransform(split_seurat[[i]],
                                     vars.to.regress = c("mitoRatio"),
                                     return.only.var.genes = TRUE,
                                     method = "glmGamPoi"
                                     )
    }

# Check which assays are stored in objects
split_seurat$ctrl@assays

```


```{r}
class(split_seurat)

class(split_seurat$ctrl)
```


7. integration

在上一步的SCT后，有一些基因在assay SCT中相较RNA assay少了一些, 是矫正后的效果，split_seurat[[“SCT”]]@counts
做integration的目的在于align same cell type across conditoins，同一群细胞不可以因为批次不同，来源不同而出现差异， 这要是规避的。

decide wether we need the integration
```{r}
colnames(seurat_phase@meta.data)

p1.compare <- wrap_plots(ncol = 3,
                      DimPlot(seurat_phase, reduction = "pca", group.by = "orig.ident") +NoAxes() + ggtitle("Before_PCA"),
                      DimPlot(seurat_phase, reduction = "tsne", group.by = "orig.ident")+NoAxes()+ggtitle("Before_tSNE"),
                      DimPlot(seurat_phase, reduction = "umap", group.by = "orig.ident")+NoAxes()+ggtitle("Before_UMAP"),
                      guides = "collect"
)
p1.compare
```

CCA/RPCA整合
```{r}
# Select the most variable features to use for integration
integ_features <- SelectIntegrationFeatures(object.list = split_seurat,
                                            nfeatures = 3000 # default is 2000
                                            )

# Prepare the SCT list object for integration
split_seurat <- PrepSCTIntegration(object.list = split_seurat,
                                   anchor.features = integ_features)

# Find best buddies - can take a while to run
# 如果用CCA的话下一步可以不用跑
split_seurat <- lapply(X = split_seurat, FUN = function(x) {
    x <- RunPCA(x, features = integ_features, verbose = FALSE)
    }
    )
integ_anchors <- FindIntegrationAnchors(object.list = split_seurat,
                                        normalization.method = "SCT",
                                        anchor.features = integ_features,
                                        reduction = 'rpca',
                                        k.anchor = 5 # 越大整合力度越大
                                        )

# Integrate across conditions
seurat_integrated <- IntegrateData(anchorset = integ_anchors,
                                   normalization.method = "SCT")

print(DefaultAssay(seurat_integrated))

# Save integrated seurat object, 毕竟运行时间较久
# saveRDS(seurat_integrated, "results/integrated_seurat.rds")
```

harmony整合, SCT是推荐单独在每一个样本运行的，运行完后merge在一起,此处需要理清默认的assay是啥.

To perform integration, Harmony takes as input a merged Seurat object, containing data that has been appropriately normalized (i.e. here, normalized using SCTransform) and for which highly variable features and PCs are defined.
```{r}

split_mergerd <- merge(split_seurat[[1]], y = pancreas.list[2:length(split_seurat)],merge.data = TRUE)

VariableFeatures(split_mergerd) <- integ_features
split_mergerd <- RunPCA(object = split_mergerd, assay = "SCT", npcs = 50,
                        # features = integ_features
                        )


split_mergerd <- RunHarmony(split_mergerd,
                            group.by.vars = "group",
                            assay.use = "SCT",
                            reduction = "pca",
                            dims.use = 1:50,
                            plot_convergence = TRUE
                            ) # 后续进行RunUMAP等

split_mergerd <- RunUMAP(split_mergerd,
                         assay = "SCT",
                         reduction="harmony",
                         dims=1:50) %>%
  FindNeighbors(reduction="harmony", assay = "SCT", dims=1:50) %>%
  FindClusters(resolution=0.8)
```


```{r}
harmony_embeddings <- Embeddings(split_mergerd, 'harmony')
harmony_embeddings[1:5, 1:5]
```


通过下面的PCA和UMAP检验数据整合的情况。

PCA, 以下的PCA等会在integrated的assay下运行
```{r}
# Run PCA
print(DefaultAssay(seurat_integrated))

seurat_integrated <- RunPCA(object = seurat_integrated)

# Plot PCA
PCAPlot(seurat_integrated,
        split.by = "sample")
```

通过UMAP查看数据整合的情况
```{r}
# Run UMAP
seurat_integrated <- RunUMAP(seurat_integrated,
                             dims = 1:40,
                             reduction = "pca")

# Plot UMAP
DimPlot(seurat_integrated)
```


8. cluster

Seurat利用Seurat assigns cells to clusters based on their PCA scores derived from the expression of the integrated most variable genes.

Determining how many PCs to include in the clustering step is therefore important to ensure that we are capturing the majority of the variation, or cell types, present in our dataset.

Seurat uses a graph-based clustering approach using a K-nearest neighbor approach, and then attempts to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.

*确定后续所用的PC数:*
不过对于采用了SCTransform方法的分析，PC数目的选择没有以前显的那么重要了。
In theory, with SCTransform, the more PCs we choose the more variation is accounted for when performing the clustering。
```{r}
# Printing out the most variable genes driving PCs
print(x = seurat_integrated[["pca"]],
      dims = 1:10,
      nfeatures = 5)

# Plot the elbow plot
ElbowPlot(object = seurat_integrated,
          ndims = 40)
```


The resolution is an important argument that sets the "granularity" of the downstream clustering and will need to be optimized for every individual experiment.
```{r}
# Determine the K-nearest neighbor graph
# 虽则ElbowPLot显示拐弯处在8，9个PC之间，但是考虑到所使用的SCT的方法，以下的dims可以适当的多一些，甚至是全部，不过 比较耗费时间
seurat_integrated <- FindNeighbors(object = seurat_integrated,
                                dims = 1:40)

# Determine the clusters for various resolutions
seurat_integrated <- FindClusters(object = seurat_integrated,
                               resolution = c(0.4, 0.6, 0.8, 1.0, 1.4))


# Assign identity of clusters
# 具体到这里的选择：
Idents(object = seurat_integrated) <- "integrated_snn_res.0.8"
```

此步骤可跑可不跑
```{r}
seurat_integrated <- RunUMAP(seurat_integrated,
                 reduction = "pca",
                 dims = 1:40)
```


```{r}
# Plot the UMAP
DimPlot(seurat_integrated,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
```


```{r}
# Extract identity and sample information from seurat object to determine the number of cells per cluster per sample
n_cells <- FetchData(seurat_integrated,
                     vars = c("ident", "orig.ident")) %>%
        dplyr::count(ident, orig.ident) %>%
        tidyr::spread(ident, n)

# View table
View(n_cells)

# UMAP of cells in each cluster by sample
# 分析不同样本间的细胞亚群间的分布，不同处理下样本的细胞组成是否相同。此处还应该对Phase、等继续进行质控分析
DimPlot(seurat_integrated,
        label = TRUE,
        split.by = "sample")  + NoLegend()
```


查看在不同的metrics中细胞分布的是否均匀
```{r}
# Determine metrics to plot present in seurat_integrated@meta.data
metrics <-  c("nCount_RNA", "nFeature_RNA", "S.Score", "G2M.Score", "mitoRatio")

FeaturePlot(seurat_integrated,
            reduction = "umap",
            features = metrics,
            pt.size = 0.4,
            sort.cell = TRUE,
            min.cutoff = 'q10',
            label = TRUE)
```

Seurat's FeaturePlot() function let's us easily explore the known markers on top of our UMAP visualizations. Let's go through and determine the identities of the clusters

```{r}
# Select the RNA counts slot to be the default assay
# The SCTransform normalization was performed only on the 3000 most variable genes, so many of our genes of interest may not be present in this data.
DefaultAssay(seurat_integrated) <- "RNA"

# Normalize RNA data for visualization purposes
seurat_integrated <- NormalizeData(seurat_integrated, verbose = FALSE)


FeaturePlot(seurat_integrated,
            reduction = "umap",
            features = c("CD14", "LYZ"),
            sort.cell = TRUE,
            min.cutoff = 'q10',
            label = TRUE)
```


9. marker identification analysis

FindAllMarkers函数在计算差异时将所有细胞都当作重复。一般用于单一样本或组别时使用。


Identification of conserved markers in all conditions:
考虑不同分组信息求的cluster比之其他cluster的差异基因; FindConservedMarkers() accepts a single cluster at a time.

在Seurat文档中，是用到SCT做差异分析的，此处应该选择啥呢？

we suggest looking for markers with large differences in expression between pct.1 and pct.2 and larger fold changes.


```{r}
DefaultAssay(seurat_integrated) <- "RNA"

cluster0_conserved_markers <- FindConservedMarkers(seurat_integrated,
                              ident.1 = 0,
                              grouping.var = "sample",
                              only.pos = TRUE,
                              logfc.threshold = 0.25
                              )

```


考虑到FindConservedMarkers每次只能运行一个cluster
```{r}
# Create function to get conserved markers for any given cluster
get_conserved <- function(cluster){
  FindConservedMarkers(seurat_integrated,
                       ident.1 = cluster,
                       grouping.var = "sample",
                       only.pos = TRUE) %>%
    rownames_to_column(var = "gene") %>%
    left_join(y = unique(annotations[, c("gene_name", "description")]),
               by = c("gene" = "gene_name")) %>%
    cbind(cluster_id = cluster, .)
}


# Iterate function across desired clusters
conserved_markers <- map_dfr(c(7,20), get_conserved)
```

利用上述得到的Marker基因identify celltype
```{r}
# Extract top 10 markers per cluster
top10 <- conserved_markers %>%
  mutate(avg_fc = (ctrl_avg_logFC + stim_avg_logFC) /2) %>%
  group_by(cluster_id) %>%
  top_n(n = 10,
        wt = avg_fc)

# Visualize top 10 markers per cluster
View(top10)


```

```{r}
# Plot interesting marker gene expression for cluster 20
FeaturePlot(object = seurat_integrated,
                        features = c("TPSAB1", "TPSB2", "FCER1A", "GATA1", "GATA2"),
                         sort.cell = TRUE,
                         min.cutoff = 'q10',
                         label = TRUE,
			 repel = TRUE)

# Vln plot - cluster 20
VlnPlot(object = seurat_integrated,
        features = c("TPSAB1", "TPSB2", "FCER1A", "GATA1", "GATA2"))
```


在探索了不同的cluster的标记基因后，我们还对同一个cluster群中（同一种细胞亚群）在不同条件下的差异基因(pseudobulk differential expression analysis);
Biological replicates are necessary to proceed with this analysis. 目前的测试数据集因不满足此分析条件，故
```{r}
# 详细内容见DE_analysis_scrnaseq.Rmd
```


```{r}

# Rename all identities
seurat_integrated <- RenameIdents(object = seurat_integrated,
                               "0" = "Naive or memory CD4+ T cells",
                               "1" = "CD14+ monocytes",
                               "2" = "Activated T cells",
                               "3" = "CD14+ monocytes",
                               "4" = "Stressed cells / Unknown",
                               "5" = "CD8+ T cells",
                               "6" = "Naive or memory CD4+ T cells",
                               "7" = "B cells",
                               "8" = "NK cells",
                               "9" = "CD8+ T cells",
                               "10" = "FCGR3A+ monocytes",
                               "11" = "B cells",
                               "12" = "NK cells",
                               "13" = "B cells",
                               "14" = "Conventional dendritic cells",
                               "15" = "Megakaryocytes",
			       "16" = "Plasmacytoid dendritic cells")


# Plot the UMAP
DimPlot(object = seurat_integrated,
        reduction = "umap",
        label = TRUE,
        label.size = 3,
        repel = TRUE)
```


最后的保存
```{r}
# Save final R object
write_rds(seurat_integrated,
          file = "./Results/seurat_labelled.rds")

# Create and save a text file with sessionInfo
sink(glue::glue("./Results/sessionInfo_scrnaseq_{lubridate::today()}.txt"))
sessionInfo()
sink()
```