singlecell-reports/02_integration/norm_integration.qmd at main · bcbio/singlecell-reports · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
---
title: "scRNA normalization and clustering"
author: "Harvard Chan Bioinformatics Core"
date: "`r Sys.Date()`"
html:
    number-sections: false
    default-image-extension: svg
    lightbox: true
    callout-icon: false
    format-links: true
    toc: true
    theme: sandstone
    echo: true
    eval: true
    message: false
    warning: false
    code-copy: true
    code-overflow: wrap
    code-fold: true
    code-line-numbers: true
    embed-resources: true
    standalone: true
    html-math-method: katex
    grid:
      sidebar-width: 250px
      body-width: 900px
      margin-width: 300px
    comments:
      hypothesis: true
params:
   project_file: ../information.R
knitr:
  opts_chunk:
    audodep: true
    cache: false
    cache.lazy: false
    error: true
    echo: true
    fig-height: 5
    fig.retina: 2
    fig-width: 9.6
    message: false
    tidy: true
    warning: true
---

Template developed with materials from https://hbctraining.github.io/main/.

```{r }
#| message: false
#| warning: false
stopifnot(R.version$major >= 4) # requires R4
if (compareVersion(R.version$minor, "3.1") < 0) warning("We recommend >= R4.3.1")
stopifnot(compareVersion(as.character(BiocManager::version()), "3.18") >= 0)
stopifnot(compareVersion(as.character(packageVersion("Seurat")), "5.0.0") >= 0)
```

This code is in this ![](https://img.shields.io/badge/status-alpha-yellow) revision.

```{r}
# parameters
## Cell cycle markers for c.elegans, human, mouse, D. rerio, and D. melanogaster can be found here: https://github.com/hbc/tinyatlas/tree/1e2136a35e773f14d97ae9cbdb6c375327b2dd2b/cell_cycle
## This files needs gene_name and phase columns to work with this template

cell_cycle_file <- "https://github.com/bcbio/resources/raw/refs/heads/main/singlecell/human_cell_cycle.csv"
seurat_obj <- "https://github.com/bcbio/bcbioR-test-data/raw/refs/heads/main/singlecell/tiny.rds"
seurat_output <- "./seurat_clust.rds"
invisible(list2env(params, environment()))
source(project_file)
```

```{r setup}
#| cache: false
#| message: false
#| warning: false
#| echo: false
knitr::opts_chunk$set(echo = TRUE)
# Load libraries
library(Seurat)
library(harmony)
library(knitr)
library(rmarkdown)
library(data.table)
library(DT)
library(patchwork)
library(clustree)
library(ggprism)
library(grafify)
library(R.utils)
library(readr)

ggplot2::theme_set(theme_prism(base_size = 12))
# https://grafify-vignettes.netlify.app/colour_palettes.html
# NOTE change colors here if you wish
scale_colour_discrete <- function(...) {
  scale_colour_manual(..., values = as.vector(grafify:::graf_palettes[["kelly"]]))
}

# Set seed for reproducibility
set.seed(1454944673L)
```

# Overview

-   Project: `r project`
-   PI: `r PI`
-   Analyst: `r analyst`
-   Experiment: `r experiment`
-   Aim: `r aim`

## Dataset

The Seurat object used as input for this report was prepared with the thresholds detailed below applied.

-nGenes \> `nFeature_RNA_cutoff` -nUMI \> `nCount_RNA_cutoff` -complexity \> `Log10GenesPerUMI_cutoff` -percent mitochondrial reads \< `mitoRatio_cutoff`

```{r load_data}
#| cache: true
# Source cell cycle markers
cc_markers <- read_csv(cell_cycle_file)
stopifnot(c("gene_name", "phase") %in% colnames(cc_markers))

# Loading QC'd object
if (isUrl(seurat_obj)) {
  seurat_qc <- readRDS(url(seurat_obj))
} else {
  seurat_qc <- readRDS(seurat_obj)
}

DefaultAssay(seurat_qc) <- "RNA"

# Define color scales for up to 24 clusters/samples
colsD <- RColorBrewer::brewer.pal(8, "Dark2")
colsM <- RColorBrewer::brewer.pal(8, "Set2")
colsL <- RColorBrewer::brewer.pal(8, "Pastel2")
# Stack same colors from dark to pastel
cols3 <- unlist(strsplit(paste(colsD, colsM, colsL, sep = "_"), "_"))
cols2 <- c(unlist(strsplit(paste(colsD, colsM, sep = "_"), "_")), "deepskyblue2")
```

After filtering, each sample contributed the following number of cells to the analysis:

```{r meta pre doub}
table(seurat_qc$orig.ident)
```

# Sources of variability Log normalization {.tabset}

In this section, we look at potential confounding variables in our (post-QC) dataset, to determine whether their effect needs to be accounted for before normalizing and integrating the data.

To enable meaningful visualization of the data, we apply a minimal normalization to our raw data (log-normalization). We then identify the top 2000 most variable genes across the log-normalized data, i.e. those with the greatest variability in expression level from one cell to the next. Finally, we calculate principal components (PCs) based on these top 2000 most variable genes, and use the first 50 PCs to derive reduced UMAP (Uniform Manifold Approximation and Projection) components.

**We start with log normalization because it is good to observe the data and any trends using a simple transformation. More complex methods like SCT can alter the data in a way that is not as intuitive to interpret.**

```{r }
#| eval: !expr file.exists("seurat_lognorm.rds")
# NOTE run the chunk below to create this object, and loading will be used while
#     knitting to speed up the rendering
seurat_lognorm <- readRDS("seurat_lognorm.rds")
```

```{r rna_norm0}
#| eval: !expr "!exists('seurat_lognorm')"
#| warning: false
#| message: false
# Normalize data
seurat_lognorm <- NormalizeData(seurat_qc,
  normalization.method = "LogNormalize",
  scale.factor = 10000
)

# Find variable genes (largest dispersion in expression across cells)
seurat_lognorm <- FindVariableFeatures(seurat_lognorm, nfeatures = 2000)

# Scale and center data
seurat_lognorm <- ScaleData(seurat_lognorm, model.use = "linear")

# Calculate PCs and UMAP
seurat_lognorm <- RunPCA(seurat_lognorm)
seurat_lognorm <- RunUMAP(seurat_lognorm, 1:40)

saveRDS(seurat_lognorm, file = "seurat_lognorm.rds")
```

## Examine highly variable genes

Highly variable gene selection is extremely important since many downstream steps are computed only on these genes. Seurat allows us to access the ranked highly variable genes with the VariableFeatures() function. We can additionally visualize the dispersion of all genes using Seurat’s VariableFeaturePlot(), which shows a gene’s average expression across all cells on the x-axis and variance on the y-axis. Ideally we want to use genes that have high variance since this can indicate a change in expression depending on populations of cells. Adding labels using the LabelPoints() helps us understand which genes will be driving shape of our data.

```{r}
# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_lognorm)
top_genes <- ranked_variable_genes[1:15]

# Plot the average expression and variance of these genes
# With labels to indicate which genes are in the top 15
p <- VariableFeaturePlot(seurat_lognorm)
LabelPoints(plot = p, points = top_genes, repel = TRUE)
```

## Sample x covariates

We then use the UMAP reduction to explore our dataset and assess how different variables influence cell clustering. Throughout this report, **UMAP representations are split by various covariates**, to enable checking for potential phenotype-specific clustering.

```{r}
## Below is an example plot, change the group.by and split.by parameters to make plots with your own covariates.

UMAPPlot(seurat_lognorm, group.by = "orig.ident") + ggtitle("UMAP")
```

## Cell cycle

The phase of the cell cycle that cells are in at the time of sample preparation can introduce some variability in the transcriptome that we are not interested in exploring.

To examine cell cycle variation in our data, we assign a score to each cell, derived from the overall expression level of known markers of the G2/M and S phase in that cell. We then display the cells, color-coded by inferred cell cycle phase, on our UMAP.

Unless cells very strongly cluster by phase of the cell cycle (which is not the case here), we do not recommend to regress out the effect of the cell cycle.

```{r cell_cycle_scoring}
#| message: false
#| warning: false
# Step 1 - Get cell cycle markers

# Compute cell cycle score for each cell
## NOTE use the right column names for cc_markers if they are different than
#   external_gene_name and phase
seurat_lognorm <- CellCycleScoring(seurat_lognorm,
  g2m.features = cc_markers$gene_name[cc_markers$phase == "G2/M"],
  s.features = cc_markers$gene_name[cc_markers$phase == "S"]
)

## Plot cell cycle (grouped by) along with covariates (split.by).  Add in your covariates of interest

UMAPPlot(seurat_lognorm, group.by = "Phase") + ggtitle("UMAP")
```

## mitoRatio

The mitochondrial to nuclear gene ratio (mitoRatio) is a marker of cellular stress and might also affect cell clustering. For this dataset, we have seen during QC that the fraction of mitochondrial genes was negligible (which is good). Therefore, we do not expect the need to regress out this variable for normalization purposes, but it's always good to check.

```{r mito_ratio}
## This custom function by Amelie Jule creates great plots for looking at different QC parameters across the UMAP

signaturePlot <- function(seurat_object,
                          gene_signature,
                          reduction = "umap",
                          split_var = NULL,
                          pt_size = 0.5) {
  g1 <- FeaturePlot(seurat_object,
    features = gene_signature,
    reduction = reduction,
    split.by = split_var,
    order = TRUE,
    pt.size = pt_size,
    combine = FALSE
  )

  min_val <- min(pull(seurat_object@meta.data, gene_signature))
  max_val <- max(pull(seurat_object@meta.data, gene_signature))
  fix_params <- scale_color_gradientn(
    colours = c("grey80", "blue"),
    limits = c(min_val, max_val)
  )

  g2 <- lapply(g1, function(x) {
    x + fix_params +
      theme_minimal() +
      theme(
        legend.position = "bottom",
        plot.title = element_text(hjust = 0.5)
      ) +
      ggtitle("")
  })

  g2
}

### Here we apply the function. gene_signature is whatever qc metric you care about and split_var should be a covariate of interest. Adjust the titles to match the covariate groups. If you have more than 2 covariate groups then you will have multiple plots g[[3]]....g[[n]]

g <- signaturePlot(seurat_lognorm,
  gene_signature = "mitoRatio",
  split_var = "orig.ident"
)
g[[1]] + ggtitle("S1") | g[[2]] + ggtitle("S2")
```

## nUMIs (nCount)

```{r}
### Here we apply the function. gene_signature is whatever qc metric you care about and split_var should be a covariate of interest. Adjust the titles to match the covariate groups. If you have more than 2 covariate groups then you will have multiple plots g[[3]]....g[[n]]

g <- signaturePlot(seurat_lognorm,
  gene_signature = "nCount_RNA",
  split_var = "subj"
)
g[[1]] + ggtitle("S1") | g[[2]] + ggtitle("S2")
```

## nGenes (nFeature)

```{r}
### Here we apply the function. gene_signature is whatever qc metric you care about and split_var should be a covariate of interest. Adjust the titles to match the covariate groups. If you have more than 2 covariate groups then you will have multiple plots g[[3]]....g[[n]]

g <- signaturePlot(seurat_lognorm,
  gene_signature = "nFeature_RNA",
  split_var = "orig.ident"
)
g[[1]] + ggtitle("S1") | g[[2]] + ggtitle("S2")
```

## Complexity

```{r}
### Here we apply the function. gene_signature is whatever qc metric you care about and split_var should be a covariate of interest. Adjust the titles to match the covariate groups. If you have more than 2 covariate groups then you will have multiple plots g[[3]]....g[[n]]

g <- signaturePlot(seurat_lognorm,
  gene_signature = "Log10GenesPerUMI",
  split_var = "orig.ident"
)
g[[1]] + ggtitle("S1") | g[[2]] + ggtitle("S2")
```

# SCT Normalization {.tabset}

Now that we have established which effects are observed in our data, we can use the SCTransform method to regress out these effects. The SCTransform method was proposed as a better alternative to the log transform normalization method that we used for exploring sources of unwanted variation. The method not only normalizes data, but it also performs a variance stabilization and allows for additional covariates to be regressed out.

All genes cannot be treated the same, as such, the SCTransform method constructs a generalized linear model (GLM) for each gene with UMI counts as the response and sequencing depth as the explanatory variable. Information is pooled across genes with similar abundances, to regularize parameter estimates and obtain residuals which represent effectively normalized data values which are no longer correlated with sequencing depth.

We searched for the top 3000 genes with the largest variability in expression level from cell to cell after SCT-normalization, and re-calculated our principal and UMAP components based on the SCT-normalized data for these top genes.

**We keep each sample separate for SCT normalization.**

```{r }
#| eval: !expr file.exists("seurat_sctnorm.rds")
# NOTE run the chunk below to create this object, and loading will be used while
#     knitting to speed up the rendering
seurat_sctnorm <- readRDS("seurat_sctnorm.rds")
```

```{r}
#| eval: !expr "!exists('seurat_sctnorm')"
#| warning: false
#| message: false
# NOTE: this should be ran previous rendering to prepare the object
## Note that this single command replaces NormalizeData(), ScaleData(), and FindVariableFeatures()

## SCT can be run with and without regressing out variables. Generally we do not regress out covariates. However, we provide both options below.
seurat_lognorm[["RNA"]] <- split(seurat_lognorm[["RNA"]],
  f = seurat_lognorm$orig.ident
)

seurat_sctnorm <- SCTransform(seurat_lognorm,
  vst.flavor = "v2",
  variable.features.n = 3000
)
seurat_sctnorm <- RunPCA(seurat_sctnorm)
seurat_sctnorm <- RunUMAP(seurat_sctnorm, 1:40)

saveRDS(seurat_sctnorm, file = "seurat_sctnorm.rds")
```

### Look at UMAPs post SCT

The plots below show the same variables as before, this time **displayed on the UMAP calculated after applying SCT-normalization**.

We qualitatively reviewed the "structure" in our normalized data projection . We were particularly interested in seeing whether similar cell populations across samples clustered together (i.e. overlapped on the UMAP).

```{r }
#| fig-width: 10
DefaultAssay(seurat_sctnorm) <- "SCT"
UMAPPlot(seurat_sctnorm, group.by = "orig.ident") + ggtitle("UMAP")
```

## Cell cycle

The phase of the cell cycle that cells are in at the time of sample preparation can introduce some variability in the transcriptome that we are not interested in exploring.

To examine cell cycle variation in our data, we assign a score to each cell, derived from the overall expression level of known markers of the G2/M and S phase in that cell. We then display the cells, color-coded by inferred cell cycle phase, on our UMAP.

Unless cells very strongly cluster by phase of the cell cycle (which is not the case here), we do not recommend to regress out the effect of the cell cycle.

```{r cell_cycle_scoring2}
#| message: false
#| warning: false
# Step 1 - Get cell cycle markers
## Cell cycle markers for c.elegans, human, mouse, D. rerio, and D. melanogaster can be found here: https://github.com/hbc/tinyatlas/tree/1e2136a35e773f14d97ae9cbdb6c375327b2dd2b/cell_cycle
# Compute cell cycle score for each cell
seurat_sctnorm <- CellCycleScoring(seurat_sctnorm,
  g2m.features = cc_markers$gene_name[cc_markers$phase == "G2/M"],
  s.features = cc_markers$gene_name[cc_markers$phase == "S"]
)

## Plot cell cycle (grouped by) along with covariates (split.by).  Add in your covariates of interest
UMAPPlot(seurat_sctnorm, group.by = "Phase", split.by = "orig.ident") + ggtitle("UMAP")
```

## mitoRatio

The mitochondrial to nuclear gene ratio (mitoRatio) is a marker of cellular stress and might also affect cell clustering. For this dataset, we have seen during QC that the fraction of mitochondrial genes was negligible (which is good). Therefore, we do not expect the need to regress out this variable for normalization purposes, but it's always good to check.

```{r mito_ratio2}
## This custom function by Amelie Jule creates great plots for looking at different QC parameters across the UMAP

signaturePlot <- function(seurat_object,
                          gene_signature,
                          reduction = "umap",
                          split_var = NULL,
                          pt_size = 0.5) {
  g1 <- FeaturePlot(seurat_object,
    features = gene_signature,
    reduction = reduction,
    split.by = split_var,
    order = TRUE,
    pt.size = pt_size,
    combine = FALSE
  )

  min_val <- min(pull(seurat_object@meta.data, gene_signature))
  max_val <- max(pull(seurat_object@meta.data, gene_signature))
  fix_params <- scale_color_gradientn(
    colours = c("grey80", "blue"),
    limits = c(min_val, max_val)
  )

  g2 <- lapply(g1, function(x) {
    x + fix_params +
      theme_minimal() +
      theme(
        legend.position = "bottom",
        plot.title = element_text(hjust = 0.5)
      ) +
      ggtitle("")
  })

  g2
}

### Here we apply the function. gene_signature is whatever qc metric you care about and split_var should be a covariate of interest. Adjust the titles to match the covariate groups. If you have more than 2 covariate groups then you will have multiple plots g[[3]]....g[[n]]

g <- signaturePlot(seurat_sctnorm,
  gene_signature = "mitoRatio",
  split_var = "orig.ident"
)
g[[1]] + ggtitle("S1") | g[[2]] + ggtitle("S2")
```

## nUMIs (nCount)

```{r}
### Here we apply the function. gene_signature is whatever qc metric you care about and split_var should be a covariate of interest. Adjust the titles to match the covariate groups. If you have more than 2 covariate groups then you will have multiple plots g[[3]]....g[[n]]

g <- signaturePlot(seurat_sctnorm,
  gene_signature = "nCount_RNA",
  split_var = "orig.ident"
)
g[[1]] + ggtitle("S1") | g[[2]] + ggtitle("S2")
```

## nGenes (nFeature)

```{r}
### Here we apply the function. gene_signature is whatever qc metric you care about and split_var should be a covariate of interest. Adjust the titles to match the covariate groups. If you have more than 2 covariate groups then you will have multiple plots g[[3]]....g[[n]]

g <- signaturePlot(seurat_sctnorm,
  gene_signature = "nFeature_RNA",
  split_var = "orig.ident"
)
g[[1]] + ggtitle("S1") | g[[2]] + ggtitle("S2")
```

## Complexity

```{r}
### Here we apply the function. gene_signature is whatever qc metric you care about and split_var should be a covariate of interest. Adjust the titles to match the covariate groups. If you have more than 2 covariate groups then you will have multiple plots g[[3]]....g[[n]]

g <- signaturePlot(seurat_sctnorm,
  gene_signature = "Log10GenesPerUMI",
  split_var = "orig.ident"
)
g[[1]] + ggtitle("S1") | g[[2]] + ggtitle("S2")
```

# Integration

## CCA integration

```{r }
#| eval: !expr file.exists("seurat_cca.rds")
# NOTE run the chunck below to create this object, and loading will be used while
#     knitting to speed up the rendering
seurat_cca <- readRDS("seurat_cca.rds")
```

```{r rna_cca}
#| warning: false
#| message: false
#| eval: !expr "!exists('seurat_cca')"
# NOTE: this should be ran previous rendering to prepare the object
## Note that this single command replaces NormalizeData(), ScaleData(), and FindVariableFeatures()

## SCT can be run with and without regressing out variables. Generally we do not regress out covariates. However, we provide both options below.

## To properly integrate with harmony we split our object by sample first.
split_sctnorm <- SplitObject(seurat_lognorm, split.by = "orig.ident")
# NOTE If we have a large dataset, then we might need to adjust the limit for allowable object sizes within R
# options(future.globals.maxSize = 4000 * 1024^2)
for (i in 1:length(split_sctnorm)) {
  split_sctnorm[[i]] <- SCTransform(split_sctnorm[[i]],
    vst.flavor = "v2",
    variable.features.n = 3000
  )
}

integ_features <- SelectIntegrationFeatures(
  object.list = split_sctnorm,
  nfeatures = 3000
)
split_sctnorm <- PrepSCTIntegration(
  object.list = split_sctnorm,
  anchor.features = integ_features
)

# Find best buddies - can take a while to run
integ_anchors <- FindIntegrationAnchors(
  object.list = split_sctnorm,
  normalization.method = "SCT",
  anchor.features = integ_features
)
# Integrate across conditions
seurat_cca <- IntegrateData(
  anchorset = integ_anchors,
  normalization.method = "SCT"
)

# Rejoin the layers in the RNA assay that we split earlier
seurat_cca[["RNA"]] <- JoinLayers(seurat_cca[["RNA"]])

# Run PCA
seurat_cca <- RunPCA(object = seurat_cca)

# Run UMAP
seurat_cca <- RunUMAP(seurat_cca,
  reduction.name = "umap.cca",
  dims = 1:40
)

saveRDS(seurat_cca, file = "seurat_cca.rds")
```

```{r}
p1 <- DimPlot(seurat_lognorm,
  group.by = "orig.ident",
  reduction = "umap"
) +
  theme(legend.position = "bottom") +
  ggtitle("pre-integration")
p2 <- DimPlot(seurat_cca,
  group.by = "orig.ident",
  reduction = "umap.cca"
) +
  theme(legend.position = "bottom") +
  ggtitle("post-integration")

p1 | p2
```

## Harmony

If cells cluster by sample, condition, batch, dataset, modality, this integration step can greatly improve the clustering and the downstream analyses.

To integrate, we will use the shared highly variable genes (identified using SCTransform) from each group, then, we will “integrate” or “harmonize” the groups to overlay cells that are similar or have a “common set of biological features” between groups.

We use [`Harmony`](https://portals.broadinstitute.org/harmony/articles/quickstart.html), which is based on a transformation of principal components (PCs) to find similarities across datasets. Here we group samples by the original sample id.

```{r rna_hrmny}
#| eval: !expr file.exists("seurat_harmony.rds")
# NOTE run the chunck below to create this object, and loading will be used while
#     knitting to speed up the rendering
seurat_harmony <- readRDS("seurat_harmony.rds")
```

```{r }
#| eval: !expr "!exists('seurat_harmony')"
#| warning: false
#| message: false
## Here seurat will integrate on the level of sample id. If you want to integrate on other aspects the SCT normalization will need to be done with all of the data together.
# seurat_sctnorm[["RNA"]] <- split(seurat_sctnorm[["RNA"]], f = seurat_sctnorm$orig.ident)
seurat_harmony <- IntegrateLayers(
  object = seurat_sctnorm,
  method = HarmonyIntegration,
  orig.reduction = "pca",
  new.reduction = "harmony",
  assay = "SCT", verbose = FALSE
)
seurat_harmony <- RunPCA(seurat_harmony)
seurat_harmony <- RunUMAP(seurat_harmony,
  reduction = "harmony",
  dims = 1:40,
  reduction.name = "umap.harmony"
)
saveRDS(seurat_harmony, file = "seurat_harmony.rds")
```

## Pre vs. Post integration

```{r dimplot_both all}
#| echo: false
p1 <- DimPlot(seurat_sctnorm,
  group.by = "orig.ident",
  reduction = "umap"
) +
  theme(legend.position = "bottom") +
  ggtitle("pre-integration")
p2 <- DimPlot(seurat_cca,
  group.by = "orig.ident",
  reduction = "umap.cca"
) +
  theme(legend.position = "bottom") +
  ggtitle("post-integration CCA")
p3 <- DimPlot(seurat_harmony,
  group.by = "orig.ident",
  reduction = "umap.harmony"
) +
  theme(legend.position = "bottom") +
  ggtitle("post-integration Harmony")

p1 | p2 | p3
```

## Clustering

For single-modality scRNA-seq analysis, `Seurat` clusters the cells using a Louvain clustering approach. First, a K-nearest neighbor (KNN) graph is built, where cells are connected if they have a similar transcriptome, as determined from their scores on the first PCs. Then, the graph is partitioned into "communities" or "clusters" of interconnected cells that are more tightly connected with each other than with cells outside of the corresponding cluster.

A limitation of this approach is that the number of identified clusters depends on the chosen resolution, a parameter that must be set by the user and does not necessarily reflect the underlying biology of the dataset. For most single-cell datasets, a resolution of 0.1 to 1 will provide a reasonable number of clusters. Complex datasets with multiple cell types may require a larger resolution, and vice versa.

The code below will generate clusters for resolutions 0.1, 0.2, 0.4, 0.6, 0.8, and 1.0. Note that the names of the cluster will include important information about which slot in seurat was used for clustering.

-   Harmony with log normalized data : SNN_res.

-   Harmony with SCT normalized data : SCT_res.

-   CCA : integrated_snn_res.

After generating these clusters we will examine them below using a tree based approach and simply overlaying cluster ids onto the umap. We recommend keeping all of the clusters in the metadata as you move forward in case you want to go back and try a different one. To switch between different resolutions you simply need to reset your cell identities to the correct column in the metadata. For example: `Idents(object = seurat_clust) <- "SCT_res.0.2"`

```{r find_neighbors all}
#| echo: true
# NOTE use seurat_harmony or seurat_cca
# seurat_clust <- FindNeighbors(seurat_harmony, assay = "SCT",
#                                 reduction = "harmony", dims = 1:40)
seurat_clust <- FindNeighbors(seurat_cca, assay = "SCT", dims = 1:40)
# check graph names names(seurat_harmony@graphs)
# DefaultAssay(object = seurat_harmony[["pca"]])
seurat_clust <- FindClusters(
  object = seurat_clust,
  resolution = c(0.1, 0.2, 0.4, 0.6, 0.8, 1.0),
  verbose = FALSE
)
```

## Clustering Tree

We build a clustering tree using the [clustree](https://lazappi.github.io/clustree/articles/clustree.html) package to show how cells move as the clustering resolution is increased. Each cluster forms a node in the tree and edges are constructed by considering the cells in a cluster at a lower resolution (say 𝑘=2) that end up in a cluster at the next highest resolution (say 𝑘=3). By connecting clusters in this way we can see how clusters are related to each other, which are clearly distinct and which are unstable. The size of each node is related to the number of samples in each cluster and the color indicates the clustering resolution. Edges are colored according to the number of samples they represent and the transparency shows the incoming node proportion, the number of samples in the edge divided by the number of samples in the node it points to.

```{r }
#| fig-height: 10
#| fig-width: 8
library(clustree)

meta <- seurat_clust@meta.data
meta <- na.omit(meta)

## Change the prefix to match your clusters
# NOTE: this if you have run HARMONY
# prefix_clu <- "SNN_res."
# show_this <- "umap.harmony"
# NOTE: this if you have run CCA
prefix_clu <- "integrated_snn_res."
show_this <- "umap.cca"
clustree(meta, prefix = prefix_clu)
```

## Visualize clusters {.tabset}

We take a look at how the clusters look at resolutions 0.1, 0.2,0.4, and 0.6

### 0.1

```{r umap_0.1}
cluster_res <- 0.1
Idents(object = seurat_clust) <- paste0(prefix_clu, cluster_res)
DimPlot(seurat_clust,
  reduction = show_this,
  split.by = "orig.ident",
  label = TRUE
)
```

------------------------------------------------------------------------

### 0.2

```{r umap_0.2}
cluster_res <- 0.2
Idents(object = seurat_clust) <- paste0(prefix_clu, cluster_res)
DimPlot(seurat_clust,
  reduction = show_this,
  split.by = "orig.ident",
  label = TRUE
)
```

------------------------------------------------------------------------

### 0.4

```{r umap_0.4}
cluster_res <- 0.4
Idents(object = seurat_clust) <- paste0(prefix_clu, cluster_res)
DimPlot(seurat_clust,
  reduction = show_this,
  split.by = "orig.ident",
  label = TRUE
)
```

------------------------------------------------------------------------

### 0.6

```{r umap_0.6}
cluster_res <- 0.6
Idents(object = seurat_clust) <- paste0(prefix_clu, cluster_res)
DimPlot(seurat_clust,
  reduction = show_this,
  split.by = "orig.ident",
  label = TRUE
)
```

------------------------------------------------------------------------

```{r}
saveRDS(seurat_clust, file = seurat_output)
```

# R session

List and version of tools used for the QC report generation.

```{r}
sessionInfo()
```