singlecell-reports/03_differential_expression/scRNA_pseudobulk.qmd at main · bcbio/singlecell-reports · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
---
title: "scRNA pseudobulk"
author: "Harvard Chan Bioinformatics Core"
date: "`r Sys.Date()`"
format:
   html:
    number-sections: false
    default-image-extension: svg
    lightbox: true
    callout-icon: false
    format-links: true
    toc: true
    theme: sandstone
    echo: true
    eval: true
    message: false
    warning: false
    code-copy: true
    code-overflow: wrap
    code-fold: true
    code-line-numbers: true
    embed-resources: true
    standalone: true
    html-math-method: katex
    grid:
      sidebar-width: 250px
      body-width: 900px
      margin-width: 300px
    comments:
      hypothesis: true
params:
   project_file: ../information.R
   seurat_obj: "https://github.com/bcbio/bcbioR-test-data/raw/refs/heads/main/singlecell/tiny.rds"
   column: "age_cluster" # this column should be a dummy variable consisting of a concatenation of the metadata factor of interest and the clusters/cell types
   contrasts: !expr list(c("age_cluster", "OLD-cluster2", "YOUNG-cluster2"), c("age_cluster", "OLD-cluster3", "YOUNG-cluster3"))
editor_options:
   chunk_output_type: console
knitr:
  opts_chunk:
    audodep: true
    cache: false
    cache.lazy: false
    error: true
    echo: true
    warning: false
    fig-height: 5
    fig.retina: 2
    fig-width: 9.6
    message: false
    tidy: true
---

Template developed with materials from <https://hbctraining.github.io/main/>.

```{r }
#| message: false
#| warning: false
stopifnot(R.version$major >= 4) # requires R4
if (compareVersion(R.version$minor, "3.1") < 0) warning("We recommend >= R4.3.1")
stopifnot(compareVersion(as.character(BiocManager::version()), "3.18") >= 0)
stopifnot(compareVersion(as.character(packageVersion("Seurat")), "5.0.0") >= 0)
```

```{r setup}
#| cache: false
#| message: false
#| warning: false
#| echo: false

# Load libraries'
library(SeuratObject)
library(Seurat)
library(harmony)
library(knitr)
library(rmarkdown)
library(data.table)
library(DT)
library(patchwork)
library(clustree)
library(ggprism)
library(grafify)
library(R.utils)
library(tidyverse)
library(DESeq2)
library(DEGreport)
library(EnhancedVolcano)
library(pheatmap)
library(viridis)
library(caTools)
library(shiny)
library(bcbioR)
library(ashr)
ggplot2::theme_set(theme_prism(base_size = 12))
scale_colour_discrete <- function(...) {
  scale_colour_manual(..., values = as.vector(grafify:::graf_palettes[["kelly"]]))
}
scale_fill_discrete <- function(...) {
  scale_fill_manual(..., values = as.vector(grafify:::graf_palettes[["kelly"]]))
}

set.seed(1454944673L)

invisible(list2env(params, environment()))
source(project_file)
```

```{r sanitize-datatable}
# Create a function to clean up data frames
sanitize_datatable <- function(df, ...) {
  # Remove dashes from row names and column names which cause wrapping
  DT::datatable(df, ...,
    rownames = gsub("-", "_", rownames(df)),
    colnames = gsub("-", "_", colnames(df))
  ) %>%
    formatRound(columns = names(df)[sapply(df, is.numeric)], digits = 3)
}
```

This code is in this ![](https://img.shields.io/badge/status-alpha-yellow) revision.

# Overview

-   Project: `r project`
-   PI: `r PI`
-   Analyst: `r analyst`
-   Experiment: `r experiment`
-   Aim: `r aim`

Here we will apply a pseudobulk approach to look for differentially expressed genes in one of our cell types.

Using a pseudobulk approach involves the following steps:

1.  Subsetting to the cells for the cell type(s) of interest to perform the DE analysis.
2.  Extracting the raw counts after QC filtering of cells to be used for the DE analysis.
3.  Aggregating the counts and metadata to the sample level.
4.  Performing the DE analysis (you need at least two biological replicates per condition to perform the analysis, but more replicates are recommended).

## Dataset

```{r load_data}
#| cache: false
# Loading object
if (isUrl(seurat_obj)) {
  seurat <- readRDS(url(seurat_obj))
} else {
  seurat <- readRDS(seurat_obj)
}

DefaultAssay(seurat) <- "RNA"
```

After filtering, each sample contributed the following number of cells to the analysis:

```{r meta pre doub}
seurat@meta.data %>%
  group_by(orig.ident) %>%
  summarize(n_bins = n()) %>%
  sanitize_datatable()
```

# Aggregate counts

## Aggregate metadata at the sample level

```{r rna_norm0}
#| warning: false
#| message: false
meta_columns <- c("orig.ident", column)
meta <- seurat@meta.data %>%
  select(meta_columns) %>%
  unique() %>%
  remove_rownames()

meta %>% sanitize_datatable()
```

## Aggregate counts

To aggregate the counts, we will use the AggregateExpression() function from Seurat. It will take as input a Seurat object, and return summed counts ("pseudobulk") for each identity class. The default is to return a matrix with genes as rows, and identity classes as columns. We have set return.seurat to TRUE, which means rather than a matrix we will get an object of class Seurat. We have also specified which factors to aggregate on, using the group.by argument.

```{r}
seurat$sample <- seurat$orig.ident

bulk <- AggregateExpression(
  seurat,
  return.seurat = T,
  assays = "RNA",
  group.by = c("sample", column)
)
```

## Add number of cells per sample per cluster to the metadata

```{r}
# Number of cells by sample and celltype
n_cells <- seurat@meta.data %>%
  dplyr::count(sample, .data[[column]]) %>%
  dplyr::rename("n_cells" = "n")
# n_cells$sample <- str_replace(n_cells$sample, "_", "-")

## extra check if aggregated sample names start with numbers
n_cells$sample <- ifelse(grepl("^\\d", n_cells$sample), paste0("g", n_cells$sample), n_cells$sample)


meta_bulk <- bulk@meta.data
meta_bulk$sample <- str_replace_all(meta_bulk$sample, "-", "_")

meta_bulk <- meta_bulk %>% left_join(n_cells, by = c("sample", column))

rownames(meta_bulk) <- meta_bulk$orig.ident
bulk@meta.data <- meta_bulk

# Turn column into a factor
# bulk[[column]] <- as.factor(bulk[[column]])
# test for consistency

stopifnot(all(Cells(bulk) == row.names(bulk@meta.data)))

bulk@meta.data %>%
  head() %>%
  sanitize_datatable()
```

# DE analysis with DESeq2

## Subset to cell type of interest

```{r}
Idents(object = bulk) <- column
```

## Check that we have enough cells

Before moving on to a pseudobulk DGE analysis, it is important to identify how many cells we aggregated for each sample. We need to make sure that we have enough cells per sample after subsetting to one celltype. We recommend 50 cells per sample to move forward with confidence.

```{r}
ggplot(bulk@meta.data, aes(x = orig.ident, y = n_cells)) +
  geom_bar(stat = "identity", color = "black", aes(fill = .data[[column]])) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(x = "Sample name", y = "Number of cells") +
  geom_text(aes(label = n_cells), vjust = -0.5)
```

## Run DE analysis

Differential gene expression analysis of count data was performed using the Bioconductor R package, DESeq2, which fits the count data to a negative binomial model.

Before fitting the model, we often look at a metric called dispersion, which is a measure for variance which also takes into consideration mean expression. A dispersion value is estimated for each individual gene, then 'shrunken' to a more accurate value based on expected variation for a typical gene exhibiting that level of expression. Finally, the shrunken dispersion value is used in the final GLM fit.

We use the below dispersion plot, which should show an inverse relationship between dispersion and mean expression, to get an idea of whether our data is a good fit for the model.

```{r}
cluster_counts <- t(FetchData(bulk, layer = "counts", vars = rownames(bulk)))
formula <- as.formula(paste0("~ ", " + ", column))


dds_to_use <- DESeqDataSetFromMatrix(cluster_counts, bulk@meta.data, design = formula)
de <- DESeq(dds_to_use)
DESeq2::plotDispEsts(de)
norm_matrix <- as.data.frame(counts(de, normalized = T))

vsd <- vst(de)
vsd_matrix <- assay(vsd)
```

Because it is difficult to accurately detect and quantify the expression of lowly expressed genes, differences in their expression between treatment conditions can be unduly exaggerated after the model is fit. We correct for this so that gene LFC is not dependent overall on basal gene expression level.

In cases there are multiple groups and conditions across groups is recommended to use dummy variables instead of interaction terms: <https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions>.

The LRT is useful for testing multiple terms at once, for example testing 3 or more levels of a factor at once, or all interactions between two variables. The LRT for count data is conceptually similar to an analysis of variance (ANOVA) calculation in linear regression, except that in the case of the Negative Binomial GLM, we use an analysis of deviance (ANODEV), where the deviance captures the difference in likelihood between a full and a reduced model.

```{r lfc_shrink}
# NOTE As a note: Use `ashr` for comparisons with many groups to be able to pull out all the contrasts; otherwise `apeglm` is fine. It shrinks less.

# NOTE We recommend LRT for time series

# resultsNames(de) # check the order is right
names_to_use <- lapply(contrasts, function(contrast) {
  coef <- paste0(contrast[1], "_", contrast[2], "_vs_", contrast[3])
})
names(contrasts) <- names_to_use
de_list <- lapply(contrasts, function(contrast) {
  resLFC <- results(de, contrast = contrast)
  coef <- paste0(contrast[1], "_", contrast[2], "_vs_", contrast[3])
  # resLFCS <- lfcShrink(de, coef=coef, type="apeglm")
  resLFCS <- lfcShrink(de, contrast = contrast, type = "ash")

  res <- as.data.frame(resLFCS) %>%
    rownames_to_column("gene_id") %>%
    dplyr::rename(lfc = log2FoldChange) %>%
    mutate(pi = abs(lfc) * -log10(padj)) %>%
    arrange(-pi)

  ## Filter out genes that have no expression or were filtered out by DESEQ2
  res <- res[res$baseMean > 0, ] %>%
    drop_na(padj) %>%
    drop_na(pvalue)

  res_sig <- res %>%
    filter(padj < 0.05) %>%
    arrange(padj)
  results <- list(lfc = resLFC, lfcs = resLFCS, all = res, sig = res_sig)
  return(results)
})

# NOTE if you add manually any other comparison to the list with the following variables,
# the code below will make the plots for those as wells:
# de_list=c(de_list, new_comparison=list(lfc=resLFC, lfcs=resLFCS, all=res, sig=res_sig))
```

## MA plot

This plot can help to:

-   Identify Differential Expression: Genes that show a significant log-fold change (M value away from 0) indicate changes in expression between conditions.
-   Assess Data Quality: The plot can help in identifying biases or systematic errors in the data. Ideally, most points should scatter around the M=0 line, indicating that there is no significant systematic difference between the conditions.
-   Visualize data dispersion: The distribution of points along the A-axis gives a sense of the spread of expression levels and any patterns or anomalies in the dataset.

::: {.panel-tabset}

```{r after_lfc_shrink}
#| results: 'asis'
#| message: false
#| warning: false
for (contrast in names(de_list)) {
  cat("### ", contrast, "\n\n")
  p1 <- degMA(as.DEGSet(de_list[[contrast]]$lfc)) + ggtitle("Before LFC Shrinking")
  print(p1)
  p2 <- degMA(as.DEGSet(de_list[[contrast]]$lfcs), limit = 2) + ggtitle("After LFC Shrinking")
  print(p2)
  cat("\n\n")
}
```

:::

## Volcano plot

This volcano plot shows the genes that are significantly up- and down-regulated as a result of the analysis comparison. The points highlighted in red are genes that have padj \< 0.05 and a log2-fold change \> 1. Points in blue have a padj \< 0.05 and a log2-fold change \< 1 and points in green have a padj \> 0.05 and a log2-fold change \> 2. Grey points are non-significant. The dashed lines correspond to the cutoff values of log2 foldchance and padj that we have chosen.

::: {.panel-tabset}

```{r volcano_plot}
#| fig-height: 6
#| results: 'asis'
# degVolcano(res_mod[,c('lfc', 'padj')], plot_text = show)
for (contrast in names(de_list)) {
  cat("### ", contrast, "\n\n")
  res <- de_list[[contrast]][["all"]]
  res_mod <- res
  show <- as.data.frame(res_mod[1:10, c("lfc", "padj", "gene_id")])
  p1 <- EnhancedVolcano(res_mod,
    lab = res_mod$gene_id,
    pCutoff = 0.05,
    selectLab = c(show$gene_id),
    FCcutoff = 0.5,
    x = "lfc",
    y = "padj",
    title = contrast,
    subtitle = "", drawConnectors = T, max.overlaps = Inf
  )
  print(p1)
  cat("\n\n")
}
```
:::

## Heatmap

::: {.panel-tabset}

```{r heapmap}
#| results: 'asis'
### Run pheatmap using the metadata data frame for the annotation
for (contrast in names(de_list)) {
  cat("### ", contrast, "\n\n")
  res_sig <- de_list[[contrast]][["sig"]]
  ma <- norm_matrix[res_sig$gene_id, ]
  if (length(res_sig$gene_id) > 2) {
    p1 <- pheatmap(ma,
      color = inferno(10),
      cluster_rows = T,
      show_rownames = F,
      border_color = NA,
      fontsize = 10,
      scale = "row",
      fontsize_row = 10,
      height = 20
    )
    print(p1)
  } else {
    print("Need >2 DEGs to make heatmap")
  }
  cat("\n\n")
}
```
:::

## Differentially Expressed Genes

::: {.panel-tabset}
```{r sig_genes_table}
#| results: 'asis'
for (contrast in names(de_list)) {
  cat("## ", contrast, " \n\n")
  res_sig <- de_list[[contrast]][["sig"]]
  t <- res_sig %>%
    sanitize_datatable()
  print(htmltools::tagList(t))
  cat(" \n\n")
}
```
:::


## Plot top 16 genes - pseudobulk

::: {.panel-tabset}

```{r top n DEGs pseudobulk}
#| fig-height: 6
#| fig-width: 8
#| results: 'asis'
n <- 16

for (contrast in names(de_list)) {
  cat("### ", contrast, "\n\n")
  res_sig <- de_list[[contrast]][["sig"]]

  if (nrow(res_sig) > 0) {
    top_n <- res_sig %>%
      slice_min(order_by = padj, n = n, with_ties = F) %>%
      dplyr::select(gene_id)
    top_n_exp <- vsd_matrix %>%
      as.data.frame() %>%
      rownames_to_column("gene_id") %>%
      # dplyr::select(-group, -group_name) %>%
      pivot_longer(!gene_id, names_to = "sample", values_to = "normalized_counts") %>%
      right_join(top_n, relationship = "many-to-many") %>%
      left_join(meta_bulk, by = c("sample" = "orig.ident")) # %>%
    # filter(.data[[column]] %in% contrasts[[contrast]][2:3]) # can uncomment this line if you want to include only the groups being compared

    p1 <- ggplot(top_n_exp, aes(x = .data[[column]], y = normalized_counts)) +
      geom_boxplot(outlier.shape = NA, linewidth = 0.5, color = "grey") +
      geom_point() +
      facet_wrap(~gene_id, scales = "free_y") +
      ggtitle(str_interp("Expression of Top ${n} DEGs - Pseudobulk")) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
    print(p1)
  } else {
    print("Need at least 1 DEG to make plot")
  }
  cat("\n\n")
}
```
:::

## Plot top 16 genes - single cell

::: {.panel-tabset}

```{r top n DEGs single cell}
#| fig-height: 6
#| fig-width: 8
#| results: 'asis'

n <- 16

# subset seurat object to include only sig DE genes, helps with speed
top_n_all <- lapply(names(de_list), function(contrast) {
  res_sig <- de_list[[contrast]][["sig"]]
  res_sig %>%
    slice_min(order_by = padj, n = n, with_ties = F) %>%
    dplyr::select(gene_id) %>%
    mutate(contrast = contrast)
}) %>% bind_rows()

seurat_topn_all <- subset(seurat, features = unique(top_n_all$gene_id))
Idents(seurat_topn_all) <- column

for (contrast in names(de_list)) {
  cat("### ", contrast, "\n\n")
  res_sig <- de_list[[contrast]][["sig"]]
  if (nrow(res_sig) > 0) {
    top_n <- res_sig %>%
      slice_min(order_by = padj, n = n, with_ties = F) %>%
      dplyr::select(gene_id)

    p <- DotPlot(seurat_topn_all,
      # idents = contrasts[[contrast]][2:3],
      features = top_n$gene_id
    ) &
      scale_color_cb_friendly(discrete = F, palette = "heatmap") &
      scale_x_discrete(guide = guide_axis(angle = 45))

    print(p)
  } else {
    print("Need at least 1 DEG to make plot")
  }

  cat("\n\n")
}
```
:::


# Methods

Seurat ([package](https://satijalab.org/seurat/), [paper](https://www.nature.com/articles/s41587-023-01767-y)) is used to aggregate the single cell expression data into pseudobulk samples, and DESeq2 ([package](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html), [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8)) is used to perform statistical analysis.

# R session

List and version of tools used for the QC report generation.

```{r}
sessionInfo()
```