bcbioR/inst/templates/spatial/visium/01_quality_assessment/qc.Rmd at main · bcbio/bcbioR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: "Visium QC"
author: "Harvard Chan Bioinformatics Core"
date: "`r Sys.Date()`"
output:
   html_document:
      code_folding: hide
      df_print: paged
      highlights: pygments
      number_sections: false
      self_contained: true
      theme: default
      toc: true
      toc_float:
         collapsed: true
         smooth_scroll: true
params:
  project_file: ../information.R
  results_dir: ./results
  visiumHD_obj: "MsBrain_FF-A1_subset.qs"
---

This code is in this ![](https://img.shields.io/badge/status-draft-grey) revision.

## Visium Description

This report was adapted from [this tutorial](https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/seurat/seurat_08_spatial.html).

Spatial transcriptomic data with the Visium (HD) platform is in many ways similar to scRNAseq data. The data is represented per spot/bin on the slide, as we have spatial barcode but no cellular barcodes.

For Visium data, each spot contains UMI counts for 5-20 cells instead of single cells, but is still quite sparse in the same way as scRNAseq data is, but with the additional information about spatial location in the tissue.

For Visium HD, the slide contain two 6.5 x 6.5 mm Capture Areas with a continuous lawn of oligonucleotides arrayed in millions of 2 x 2 µm barcoded squares without gaps, achieving **single cell–scale spatial resolution**. The data is output at 2 µm, as well as multiple bin sizes. You can choose the bin resolution for downstream visualization and analysis.

The term **spot(s)/bin(s)** are used throughout this tutorial which corresponds to two technology Visium and Visium HD.

The main objective of quality control is to filter the data so that we include only data from spots/bins that are of high quality. This makes it so that when we cluster our spots/bins, it is easier to identify distinct cell type populations.

In spatial transcriptomic data, the main challenge is in **delineating spots/bins that are poor quality from spots/bins containing reads from less complex cells**. If you expect a particular cell type in your dataset to be less transcriptionally active as compared other cell types in your dataset, the spots/bins underneath this cell type will naturally have fewer detected genes and transcripts. However, having fewer detected genes and transcripts can also be a technical artifact and not a result of biological signal.

Various metrics can be used to filter low-quality cells from high-quality ones, including:

- **UMI counts per spot/bin** - This is the number of unique transcripts detected per spot/bin. Because the spot/bin are very small, this number is less than what we would expect for non-spatial scRNAseq data.
- **Genes detected per spot/bin** - This is the number of unique genes detected per spot/bin. Again, because the spots/bins are very small, this number is less than what we would expect for non-spatial scRNAseq data.
- **Complexity (novelty score)** - The novelty score is computed by taking the ratio of nGenes over nUMI. If there are many captured transcripts (high nUMI) and a low number of genes detected in a spot, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again. These low complexity (low novelty) spots/bins could represent a specific cell type (i.e. red blood cells which lack a typical transcriptome), or could be due to an artifact or contamination. Generally, we expect the novelty score to be above 0.80 for good quality spots/bins.
- **Mitochondrial counts ratio** - This metric can identify whether there is a large amount of mitochondrial contamination from dead or dying cells. We define poor quality samples for mitochondrial counts as spots/bins which surpass the 0.2 (20%) mitochondrial ratio mark, unless of course you are expecting this in your sample.
- **Hemoglobin counts ratio** - This metric can identify whether there is a large amount of hemoglobin gene contamination from blood. We define poor quality samples for hemoglobin counts as spots/bins which surpass the 0.2 (20%) hemoglobin ratio mark, unless of course you are expecting this in your sample.

```{r, cache = FALSE, message = FALSE, warning=FALSE}
# This set up the working directory to this file so all files can be found
library(rstudioapi)
setwd(fs::path_dir(getSourceEditorContext()$path))
stopifnot(R.version$major>= 4) # requires R4
if (compareVersion(R.version$minor,"3.1")<0) warning("We recommend >= R4.3.1")
stopifnot(compareVersion(as.character(BiocManager::version()), "3.16")>=0)
stopifnot(compareVersion(as.character(packageVersion("Seurat")), "5.1")>=0)
```


```{r load_libraries, cache = FALSE, message = FALSE, warning=FALSE, echo=FALSE,}
library(bcbioR)
library(knitr)
library(glue)
library(qs)
library(dplyr)
library(purrr)
library(ggplot2)
library(ggprism)
library(grafify)
library(ggpubr)
library(gridExtra)
library(scales)
library(Seurat)

import::from(magrittr,set_colnames,set_rownames,"%<>%")

invisible(list2env(params,environment()))
source(project_file)

ggplot2::theme_set(theme_prism(base_size = 12))
# https://grafify-vignettes.netlify.app/colour_palettes.html
# NOTE change colors here if you wish
scale_colour_discrete <- function(...)
  scale_colour_manual(...,
                      values = as.vector(grafify:::graf_palettes[["kelly"]]))
scale_fill_discrete <- function(...)
  scale_fill_manual(...,
                      values = as.vector(grafify:::graf_palettes[["kelly"]]))

opts_chunk[["set"]](
    cache = F,
    cache.lazy = FALSE,
    dev = c("png", "pdf"),
    error = TRUE,
    highlight = TRUE,
    message = FALSE,
    prompt = FALSE,
    tidy = FALSE,
    warning = FALSE,
    echo = T,
    fig.height = 4)

# set seed for reproducibility
set.seed(1234567890L)
```

## Project details

-   Project: `r project`
-   PI: `r PI`
-   Analyst: `r analyst`
-   Experiment: `r experiment`
-   Aim: `r aim`

```{r metric-calc}
# Metrics like `nCount` and `nfeature` are named with the suffix of default assay name, to make the variable usage more generalizable, we removed the suffix by pulling out the default assay of the visium `seurat` object.
visium <- qread(visiumHD_obj)
visium <- PercentageFeatureSet(visium, "^mt-", col.name = "percent_mito")
visium <- PercentageFeatureSet(visium, "^Hb.*-", col.name = "percent_hb")
metaD <- visium@meta.data
metaD$log10GenesPerUMI <- log10(metaD$nFeature)/log10(metaD$nCount)
colnames(metaD)%<>%gsub(pattern=glue("_{DefaultAssay(visium)}"),replacement="")
```

Let's take a quick look at the data and make a decision on whether we need to apply any filtering.

## Quality control per spot/bin{.tabset}

### Number of UMIs and genes detected per spot/bin

Those two metrics is really dependent on tissue type, RNA quality, and sequencing depth. Since the test data is generated from Visium HD technology, we use bin and corresponding reference thresholds in the plot. Reference line at 100 is plotted as the suggested cut-offs for both metrics.

```{r}
summary_metaD <- apply(metaD[,-1],2,mean)
metacol_label <- list("nFeature"="Genes","nCount"="UMI")
refs <- list("nFeature"=100,"nCount"=100)
dists_before <- imap(metacol_label,\(label,col)
  ggdensity(metaD,
          x = col,xscale="log10",add = "mean", rug = TRUE,
          alpha = 0.2,fill = "lightgray",
          xlab=glue("Number of {label} per bin(in log10 scale)"),
          ylab="Cell density",
          title=glue('Pre-QC {label}/Bin'))+
  geom_vline(xintercept = refs[[col]],color="darkred",cex=rel(1.3),linetype="dashed")+
  annotate("text",x=summary_metaD[col],y = Inf,
           label = glue("Mean \n = {round(summary_metaD[col],0)}"),
            vjust = 1,hjust=2)
)
dists_before[[1]] | dists_before[[2]]
```


### Overall complexity of transcriptional profile per spot/bin

We can evaluate each spot/bin in terms of how complex the RNA species are by using a measure called the novelty score. The novelty score is computed by taking the ratio of nGenes over nUMI. If there are many captured transcripts (high nUMI) and a low number of genes detected in a cell, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again.

With scRNA-seq this is more easily interpreted for a single cell, but for spatial data this would give us complexity of the spot, which is across multiple cells.

```{r complexity}
col <- "log10GenesPerUMI"
ggdensity(metaD,x = col,add = "mean", rug = TRUE,
          alpha = 0.2,fill = "lightgray",
          xlab="complexity",ylab="Cell density",title=glue('Novelty score'))+
  geom_vline(xintercept = 0.8,color="darkred",cex=rel(1.3),linetype="dashed")+
  annotate("text",x=summary_metaD[col],y = Inf,
           label = glue("Mean = {round(summary_metaD[col],0)}"),
            vjust = 1,hjust=2)+
    theme(plot.title = element_text(hjust=0.5, face="bold"))
```

### mitochondria & hemoglospot/bingene ratios

```{r}
ggplot(metaD %>%
         select(orig.ident,starts_with("percent_")) %>%
         tidyr::gather(class,percent_unexpected,-orig.ident),
        aes_string(x = "orig.ident", y = "percent_unexpected")) +
    geom_violin(position=position_dodge(1),alpha=1, na.rm=TRUE,trim=FALSE)+
    ggbeeswarm::geom_quasirandom(na.rm=TRUE,dodge.width=0.5,
                                 method='quasirandom',alpha=0.01)+
    geom_boxplot(width=0.1,outliers = F)+
   geom_hline(yintercept=20)+
   facet_grid(~class,scales = "free")+
    theme(
      axis.text.x = element_text(size=rel(1),face="bold"),
      plot.title = element_text(hjust = 0.5),
      strip.text.x = element_text(size = rel(1.5), colour = "black"),
      legend.position = "none"
          )+
  scale_y_log10(breaks=c(1,5,10,20,100))+
  # ylim(c(0,100))+
  labs(x="",y="% of contamination genes")
```

## QC metrics visualized on slides{.tabset}

Here, we can look at all the QC metrics we discussed above on the individual tissue slide.

```{r}
features2check <- c(glue('nCount_{DefaultAssay(visium)}'),
                    glue('nFeature_{DefaultAssay(visium)}'),
                    "percent_mito","percent_hb")
```

```{r spatial-plot,fig.height=5,fig.width=5,eval=T,results='asis'}
for(f in features2check){
  cat("### ", f, "\n\n")
  p1 <- SpatialFeaturePlot(visium,
                   feature = f,
                  pt.size.factor = 8)
  print(p1)
  cat("\n\n")
}
```

## Top expressed genes

Now, it is time to choose some cut-offs for QC metrics mentioned above and removing low-quality cells, as well as mitochondria, hemoglobin genes from the feature space and we can take a quick look at what are our top 20 expressed genes.

```{r filtering,fig.height=7,fig.width=7}
GeneVar <- glue('nFeature_{DefaultAssay(visium)}')
UMIVar <- glue('nCount_{DefaultAssay(visium)}')
cutoffs <- list("nFeature"=100,"nCount"=100,"hb"=20,"mito"=20)
Qced <-  visium@meta.data[,GeneVar] > cutoffs$nFeature &
     visium@meta.data[,UMIVar] > cutoffs$nCount &
     visium$percent_hb < cutoffs$hb &
     visium$percent_mito < cutoffs$mito
visium <- visium[,Qced]
# Filter Mitocondrial
visium <- visium[!grepl("^mt-", rownames(visium)), ]
# Filter Hemoglobin gene (optional if that is a problem on your data)
visium <- visium[!grepl("^Hb.*-", rownames(visium)), ]

C <- GetAssayData(visium, slot = "counts")
C@x <- C@x / rep.int(colSums(C), diff(C@p))
most_expressed <- order(Matrix::rowSums(C), decreasing = T)[20:1]
exprD <- as.data.frame(t(C[most_expressed, ])) %>%
  tibble::rownames_to_column("bin") %>%
  tidyr::gather(gene,expr,-bin)


ggplot(exprD,aes(x=gene,y=expr,color=gene,fill=gene))+
  geom_violin(position=position_dodge(1),alpha=0.5,
              na.rm=TRUE,trim=FALSE)+
  geom_boxplot(width=0.1,outliers = F,color="black")+
  theme_minimal()+
  theme(
      axis.text.x = element_text(size=rel(1),face="bold"),
      plot.title = element_text(hjust = 0.5),
      legend.position = "none"
    )+
  scale_y_log10(breaks=c(0.001,.01,.1,1),labels=c(0.1,1,10,100))+
  labs(x="",y="% of total UMIs/bin \n (log10 scaled)")+
  coord_flip()
```


```{r save_seurat}
if(!dir.exists(results_dir)){
  system(glue("mkdir -p {results_dir}"))
}
saveRDS(visium, file.path(results_dir, "01_qc.RDS"))
outputPath = file.path(results_dir, "01_qc.RDS")
```

We saved your qc-filled Seurat object in **`r outputPath`**.

## Methods


### Citation

```{r citations}
citation("Seurat")
```

### Session Information

```{r}
sessionInfo()
```