-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscATAC-Seq_preprocessing_PBMCs_Signac.qmd
186 lines (131 loc) · 5.85 KB
/
scATAC-Seq_preprocessing_PBMCs_Signac.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
title: "Scrip to pre-process Single Cell ATAC-Seq data"
author: "Monica L. Rojas-Pena"
format: html
editor: visual
---
## scATAC-Seq analysis
This a pipeline for preprocessing scATAC-Seq data to identify clusters
## Dataset
Single-cell ATAC-Seq (scATAC-Seq) dataset from human peripheral blood mononuclear cells (PBMCs) from 10X genomics.
The goal is to preprocess scATAC-Seq data using the Signac R package
```{r}
#Vignette: https://stuartlab.org/signac/articles/pbmc_vignette
#Intall packages
remotes::install_github("stuart-lab/signac", ref="develop")
install.packages("Matrix", type = "source")
install.packages("irlba", type = "source")
bio_pkgs = c("EnsDb.Hsapiens.v75", "biovizBase")
BiocManager::install(bio_pkgs)
#libraries
library(Signac)
library(Seurat)
library(EnsDb.Hsapiens.v75)
library(biovizBase)
library(tidyverse)
library(irlba)
library(Matrix)
```
The files we will be using for the analysis are:
1. Fragment file
2. Fragment index file (.tbi)
3. Raw data
4. Metadata
```{r}
#reading raw data
counts <- Read10X_h5('atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5')
counts[1:10, 1:10]
metadata <- read.csv(file = 'atac_v1_pbmc_10k_singlecell.csv', head = T, row.names= 1)
head(metadata)
```
Here we can see the regions and rows and each value in the matrix represents the number of Tn5 integration sites for each single barcode.
```{r}
#first we have to create an assay object
chrom_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
fragments = 'atac_v1_pbmc_10k_fragments.tsv.gz',
min.cells = 10,
min.features = 200
)
str(chrom_assay)
#creating a Seurat object
pbmc <- CreateSeuratObject(
counts = chrom_assay,
assay = "ATAC",
meta.data = metadata
)
str(pbmc)
```
Next we add the gene annotation
```{r}
#adding gene annotations to the seurat object (NULL)
pbmc@assays$ATAC@annotation
# extract gene annotations from EnsDb
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v75)
# change to UCSC style since the data was mapped to hg19 (since the chromosomes are missing the chr prefix)
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
# add the gene information to the object
Annotation(pbmc) <- annotations
pbmc@assays$ATAC@annotation
```
## Quality control
To asses the quality of the data we need to see at these 5 metrics:
1. Nucleosome banding pattern/Nucleosome Signal: ratio of mononucleosome fragments over nucleosome-free fragment. Cells with low nucleosome signal, are cells that have an enrichment of fragments that are not bind by nucleosomes.
2. Transcriptional start site (TSS) enrichment score: The fragments from the nucleosome free regions are expected to be enriched around the TSS, is the regions are from the open region not bind by any nucleosome, the gene in this genes will be actively transcribe or regulated, so it is expected to have an enrichment around the transcription start sides of this genes.
TSS enrichment score = Fragments centered at TSS over Fragments in TSS-flanking regions. We would like a high value.
3. Total number of fragments in peaks: Measure of cellular sequencing depth/complexity, low depth reads are removed.
4. Ratio reads in genomic blacklist regions: Cells with \< 15-20% fragments in peaks are low quality or technical artifacts.
5. Ratio reads in genomic blacklist regions (encode project): Cell with high proportion of reads mapping to regions associated with artifactual signals.
```{r}
# compute nucleosome signal score per cell
pbmc <- NucleosomeSignal(pbmc)
# compute TSS enrichment score per cell
pbmc <- TSSEnrichment(object = pbmc, fast = FALSE)
# add blacklist ratio and fraction of reads in peaks
pbmc$blacklist_ratio <- pbmc$blacklist_region_fragments / pbmc$peak_region_fragments
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$passed_filters * 100
View([email protected])
```
Now we visualize the QC results to decide the cut off for further analysis
```{r}
# Visualizing QC
#generate an scater plot with the densities
colnames([email protected])
a1 <- DensityScatter(pbmc, x = 'nCount_ATAC', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
a2 <- DensityScatter(pbmc, x = 'nucleosome_signal', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
a1 | a2
VlnPlot(object = pbmc,
features = c('nCount_ATAC', 'nFeature_ATAC', 'TSS.enrichment', 'nucleosome_signal', 'blacklist_ratio', 'pct_reads_in_peaks'),
pt.size = 0.1,
ncol = 6)
```
Based on the visualization now we perform a filter of poor quality data, we retain cells with greater than 3000 and less than 30000 counts, having more than 15% reads in the pics, having a blacklist ration less than 0.05, a nuclosome signal less than 4, and a TSS enrichment greater than 3.
```{r}
# Filtering poor quality cells
pbmc <- subset(x = pbmc,
subset = nCount_ATAC > 3000 &
nCount_ATAC < 30000 &
pct_reads_in_peaks > 15 &
blacklist_ratio < 0.05 &
nucleosome_signal < 4 &
TSS.enrichment > 3)
```
After filtering we run the normalization step, this step normalizes based on differences in cellular sequencing depth across the cells, and also across the peaks.
```{r}
# Normalization and linear dimensional reduction
pbmc <- RunTFIDF(pbmc) # normalization
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0') # selecting top features
pbmc <- RunSVD(pbmc) # dimensionality reduction similar to PCA
DepthCor(pbmc) #correlates sequencing depth with the reduce dimension (first component -technical variation, so we exclude this for following steps)
```
```{r}
#Non-linear dimensional reduction and Clustering
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30) #excluding the first component
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, algorithm = 3)
DimPlot(object = pbmc, label = TRUE) + NoLegend()
```
```{r}
sessionInfo()
```