-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCallPeak.Rmd
334 lines (298 loc) · 16.8 KB
/
CallPeak.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
title: "ATAC-Seq Data Analysis with Human Data - Call Peaks, Perform Quality Control, and Annotate Peaks"
author: "Anni Liu"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
html_document:
code_folding: show
---
```{r, shorcut, include=FALSE}
## RStudio keyboard shortcut
# Cursor at the beginning of a command line: Ctrl+A
# Cursor at the end of a command line: Ctrl+E
# Clear all the code from your console: Ctrl+L
# Create a pipe operator %>%: Ctrl+Shift+M (Windows) or Cmd+Shift+M (Mac)
# Create an assignment operator <-: Alt+- (Windows) or Option+-(Mac)
# Knit a document (knitr): Ctrl+Shift+K (Windows) or Cmd+Shift+K (Mac)
# Comment or uncomment current selection: Ctrl+Shift+C (Windows) or Cmd+Shift+C (Mac)
```
# Peak calling
## Install MACS2 using `pip`
```{bash}
/Users/your_name/Library/r-miniconda-arm64/bin/pip install macs2
# macs2 and pip are in the same virtual environment
```
## Install MACS2 using `conda`
```{bash}
# Reference: https://docs.conda.io/projects/conda/en/latest/commands/install.html
conda install -p /Users/your_name/Library/r-miniconda-arm64/envs/atac -c "bioconda/label/cf201901" macs2
```
## Install MACS2 using `Herper`
```{r}
# Reference:
# https://www.bioconductor.org/packages/release/bioc/vignettes/Herper/inst/doc/QuickStart.html
# https://anaconda.org/bioconda/macs2
library(Herper)
tool_dir <- install_CondaTools(tools = "macs3", env = "atac", updateEnv = TRUE)
# pathToMiniConda: NULL Path to miniconda installation
# updateEnv Update existing package's conda environment if already installed.
# Retured outcomes:
# * Using Miniconda at: /Users/your_name/Library/r-miniconda-arm64
# The environment 'atac' already exists but the tools were not installed because the 'updateEnv' argument was set to FALSE.
#
# Conda and Environment Information
# pathToConda : /Users/your_name/Library/r-miniconda-arm64/bin/conda
# environment : atac
# pathToEnvBin : /Users/your_name/Library/r-miniconda-arm64/envs/atac/bin
tool_dir
# $pathToConda
# [1] "/Users/your_name/Library/r-miniconda-arm64/bin/conda"
#
# $environment
# [1] "atac"
#
# $pathToEnvBin
# [1] "/Users/your_name/Library/r-miniconda-arm64/envs/atac/bin"
```
## [Outdated] Nucleosome-free peak calling from single-end ATAC-seq data
As we do not know the length of fragments, we typically shift the cutting ends (5') of a read by -100 in 3'->5' direction and then extend reads in 5'->3' direction by 200, which gives us a wider region to increase the chance of calling the peaks in nucleosome-free open regions.
```{r}
# Reference:
# https://pypi.org/project/MACS2/
# To find enriched cutting sites such as some DNAse-Seq datasets. In this case, all 5' ends of sequenced reads should be extended in both directions to smooth the pileup signals. If the wanted smoothing window is 200bps, then use --nomodel --shift -100 --extsize 200.
# For certain nucleosome-seq data, we need to pile up the centers of nucleosomes using a half-nucleosome size for wavelet analysis (e.g. NPS algorithm). Since the DNA wrapped on nucleosome is about 147bps, this option can be used: --nomodel --shift 37 --extsize 73.
# https://hbctraining.github.io/Intro-to-ChIPseq-flipped/lessons/06_peak_calling_macs.html
with_CondaEnv("atac",
system2(command = "macs2",
args = c("callpeak",
"-t", "single_end.bam",
"--nomodel",
"--shift", "-100",
"--extsize", "200",
"--format", "BAM",
"-g", "hs"),
stdout = T))
# -g: mappable genome size which is defined as the genome size which can be sequenced; some precompiled values provided. The default hs -- 2.7e9 is recommended for human genome. Here are all precompiled parameters for effective genome size: hs: 2.7e9 | mm: 1.87e9 | ce: 9e7 | dm: 1.2e8
# stdout: where output to ‘stdout’ should be sent. Possible values are "", to the R console (the default), NULL or FALSE (discard output), TRUE (capture the output in a character vector) or a character string naming a file.
```
## [Outdated] Nucleosome occupied peak calling from single-end ATAC-seq data
As we do not know the length of fragments, we typically shift the cutting ends (5') of a read by 37 in 3'->5' direction and then extend reads in 5'->3' direction by 73.
```{r}
with_CondaEnv("atac",
system2(command = "macs2",
args = c("callpeak",
"-t", "single_end.bam",
"--nomodel",
"--shift", "37", # Notice
"--extsize", "73", # Notice
"--format", "BAM",
"-g", "hs"),
stdout = T))
```
## Nucleosome-free peak calling from paired-end sequencing - no concerns on the correct specifications of `shift` and `extsize`
```{bash}
# Reference: https://pypi.org/project/MACS2/
# https://biohpc.cornell.edu/lab/doc/Chip-seq_workshop_lecture1.pdf
# -N/--NAME
# The name string of the experiment. MACS will use this string NAME to create output files like NAME_peaks.xls, NAME_negative_peaks.xls, NAME_peaks.bed , NAME_summits.bed, NAME_model.r and so on. So please avoid any confliction between these filenames and your existing files.
# -F/--FORMAT FORMAT
# Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. AUTO is also useful when you combine different formats of files. Note that MACS can't detect BAMPE or BEDPE format with AUTO, and you have to implicitly specify the format for BAMPE and BEDPE.
# Nowadays, the most common formats are BED or BAM (including BEDPE and BAMPE). Our recommendation is to convert your data to BED or BAM first.
# Also, MACS2 can detect and read gzipped file. For example, .bed.gz file can be directly used without being uncompressed with --format BED.
# BEDPE or BAMPE
# A special mode will be triggered while the format is specified as BAMPE or BEDPE. In this way, MACS2 will process the BAM or BED files as paired-end data. Instead of building a bimodal distribution of plus and minus strand reads to predict fragment size, MACS2 will use actual insert sizes of pairs of reads to build fragment pileup.
# The BAMPE format is just a BAM format containing paired-end alignment information, such as those from BWA or BOWTIE.
/Users/your_name/Library/r-miniconda-arm64/bin/macs2 callpeak -t Documents/WCM_Project/ATAC_seq/Sorted_ATAC_female_lung_bowtie2_chr1X_free.bam --outdir Documents/WCM_Project/ATAC_seq/Sorted_ATAC_female_lung_bowtie2_chr1X_free_narrowpeak --format BAMPE --name female_lung -g hs
```
```{r suspend}
with_CondaEnv("atac",
system2(command = "macs2",
args = c("callpeak",
"-t", "Documents/WCM_Project/ATAC_seq/Sorted_ATAC_female_lung_bowtie2_chr1X_free.bam", # Nucleosome-free open region BMA file
"--outdir", "Documents/WCM_Project/ATAC_seq/Sorted_ATAC_female_lung_bowtie2_chr1X_free_narrowpeak/",
"--format", "BAMPE",
"--name", "female_lung",
"-g", "hs"),
stdout = T))
```
## Use MACS3-based MACSr to call peaks in the nucleosome-free open regions from paired-end sequencing
```{r}
# Reference: https://github.com/macs3-project/MACS/blob/master/docs/callpeak.md
library(MACSr)
cp1 <- callpeak(tfile="Sorted_ATAC_female_lung_bowtie2_mainchr_free.bam",
gsize = "hs",
format = "BAMPE",
name = "chr1_narrow",
outdir = "Sorted_ATAC_female_lung_bowtie2_mainchr_free_peak",
cutoff_analysis = T)
```
# Quality control - evaluate quality of reads in peaks (e.g., duplication rate, low quality reads, and reads in artifact region)
## Create an object containing quality metrics computed for a ATAC-seq sample
```{r}
# Blacklisting prevents the problematic regions (e.g., centromeres, telomeres, mitochondrial DNA, dense repeat regions, resiongs with high GC content or segmental duplications)
# Blacklist for hg19: https://www.encodeproject.org/files/ENCFF001TDO/
# Blacklist for hg38: https://www.encodeproject.org/files/ENCFF356LFX/
library(ChIPQC)
library(rtracklayer)
library(DT)
library(dplyr)
library(tidyr)
(black_hg19 <- import.bed("ENCFF001TDO.bed.gz"))
# GRanges object with 411 ranges and 2 metadata columns:
# seqnames ranges strand | name score
# <Rle> <IRanges> <Rle> | <character> <numeric>
# [1] chr1 564450-570371 * | High_Mappability_isl.. 1000
# [2] chr1 724137-727043 * | Satellite_repeat 1000
# [3] chr1 825007-825115 * | BSR/Beta 1000
# [4] chr1 2583335-2634374 * | Low_mappability_island 1000
# [5] chr1 4363065-4363242 * | (CATTC)n 1000
# ... ... ... ... . ... ...
# [407] chrY 28555027-28555353 * | TAR1 1000
# [408] chrY 28784130-28819695 * | Satellite_repeat 1000
# [409] chrY 58819368-58917648 * | (CATTC)n 1000
# [410] chrY 58971914-58997782 * | (CATTC)n 1000
# [411] chrY 59361268-59362785 * | TAR1 1000
# -------
# seqinfo: 25 sequences from an unspecified genome; no seqlengths
peak_open_region <- "Sorted_ATAC_peak.narrowPeak"
qc_res <- ChIPQCsample("Sorted_ATAC_open_region.bam",
peaks = peak_open_region,
annotation = "hg19",
chromosomes = "chr20",
blacklist = black_hg19,
verboseT = FALSE)
# https://support.bioconductor.org/p/64185/
```
```{r}
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
(black_hg38 <- import.bed("ENCFF356LFX.bed.gz"))
# GRanges object with 910 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 628904-635104 *
# [2] chr1 5850088-5850571 *
# [3] chr1 8909611-8910014 *
# [4] chr1 9574581-9574997 *
# [5] chr1 32043824-32044203 *
# ... ... ... ...
# [906] chrY 11290798-11334278 *
# [907] chrY 11493054-11592850 *
# [908] chrY 11671015-11671046 *
# [909] chrY 11721529-11749472 *
# [910] chrY 56694633-56889743 *
# -------
# seqinfo: 24 sequences from an unspecified genome; no seqlengths
peak_all_region <- "Sorted_ATAC_female_lung_bowtie2_peak/female_lung_peaks.Peak"
qc_res <- ChIPQCsample("Sorted_ATAC_female_lung_bowtie2.bam",
peaks = peak_all_region,
annotation = "hg38",
chromosomes = "chr17",
blacklist = black_hg38,
verboseT = FALSE)
```
## Compute QC metrices
```{r}
# Calculate percentage of reads in the peaks and reads in the blacklist
met_res <- QCmetrics(qc_res)
met_res[c("RiP%", "RiBL%")]
# RiP% RiBL%
# 35.0000 0.0245
# Calculate the percentage of duplicated reads among mapped reads
(flag_count <- flagtagcounts(qc_res))
# UnMapped Mapped Duplicates MapQPass MapQPassAndDup DuplicateByChIPQC
# 0 2175486 0 1928324 0 325481
(duplicate_rate <- flag_count["DuplicateByChIPQC"] / flag_count["Mapped"] * 100)
# DuplicateByChIPQC
# 14.9613
```
## Remove blacklisted peaks
Notice that NOT remove blacklisted peaks earlier, as such removal will hide the QC issues
```{r}
(target_call <- granges(qc_res[seqnames(qc_res) %in% "chr17"]))
# GRanges object with 4366 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr17 69625-69763 *
# [2] chr17 76711-76993 *
# [3] chr17 111994-112405 *
# [4] chr17 117170-118055 *
# [5] chr17 122955-123297 *
# ... ... ... ...
# [4362] chr17 83203172-83203459 *
# [4363] chr17 83205621-83205771 *
# [4364] chr17 83208395-83208833 *
# [4365] chr17 83229113-83229258 *
# [4366] chr17 83247185-83247332 *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
data.frame(Peak_blacklisted = sum(target_call %over% black_hg38), # %over%: Finding overlapping ranges
Peak = sum(!target_call %over% black_hg38))
(peak_final <- target_call[!target_call %over% black_hg38])
# GRanges object with 4334 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr17 69625-69763 *
# [2] chr17 76711-76993 *
# [3] chr17 111994-112405 *
# [4] chr17 117170-118055 *
# [5] chr17 122955-123297 *
# ... ... ... ...
# [4330] chr17 83203172-83203459 *
# [4331] chr17 83205621-83205771 *
# [4332] chr17 83208395-83208833 *
# [4333] chr17 83229113-83229258 *
# [4334] chr17 83247185-83247332 *
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths
saveRDS(peak_final, file = "peak_final.RData")
```
# Peak annotation
## Associate peaks with genomic features (e.g., genes, promoters, enhancers)
```{r}
# Annotate each peak region with its closest gene
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
(peak_final_anno <- annotatePeak(peak = peak_final, TxDb = TxDb.Hsapiens.UCSC.hg38.knownGene))
# Display the annotation information - where the peaks locate in the genome
plotAnnoPie(peak_final_anno)
# Promoter (<=1kb) 31.7951084
# 10 Promoter (1-2kb) 6.0913706
# 11 Promoter (2-3kb) 5.1453623
```
## Annotate peaks in TSS regions with genomic features
```{r}
peak_final_anno_2 <- as.GRanges(peak_final_anno)
# Select peaks landing in TSS regions (+/-500)
(peak_final_anno_tss <- peak_final_anno_2[abs(peak_final_anno_2$distanceToTSS) < 500])
# GRanges object with 1177 ranges and 9 metadata columns:
# seqnames ranges strand | annotation geneChr geneStart geneEnd geneLength
# <Rle> <IRanges> <Rle> | <character> <integer> <integer> <integer> <integer>
# [1] chr17 76711-76993 * | Promoter (<=1kb) 17 64099 76866 12768
# [2] chr17 215963-216247 * | Promoter (<=1kb) 17 213324 215933 2610
# [3] chr17 334101-334582 * | Promoter (<=1kb) 17 319522 334066 14545
# [4] chr17 409886-410730 * | Promoter (<=1kb) 17 404541 410031 5491
# [5] chr17 560021-560327 * | Promoter (<=1kb) 17 551602 559898 8297
# ... ... ... ... . ... ... ... ... ...
# [1173] chr17 82931632-82932035 * | Promoter (<=1kb) 17 82931949 82942571 10623
# [1174] chr17 82932766-82933033 * | Promoter (<=1kb) 17 82932687 82937847 5161
# [1175] chr17 83048574-83048763 * | Promoter (<=1kb) 17 82943791 83048782 104992
# [1176] chr17 83051487-83052188 * | Promoter (<=1kb) 17 82942149 83051770 109622
# [1177] chr17 83083363-83084219 * | Promoter (<=1kb) 17 83084073 83094844 10772
# geneStrand geneId transcriptId distanceToTSS
# <integer> <character> <character> <numeric>
# [1] 2 105377826 ENST00000623180.1 0
# [2] 2 9501 ENST00000572965.1 -30
# [3] 2 9501 ENST00000573588.5 -35
# [4] 2 105371430 ENST00000599026.1 0
# [5] 2 55275 ENST00000679978.1 -123
# ... ... ... ... ...
# [1173] 1 6904 ENST00000576603.5 0
# [1174] 1 6904 ENST00000573364.1 79
# [1175] 2 146712 ENST00000570947.5 19
# [1176] 2 146712 ENST00000320865.4 0
# [1177] 1 284207 ENST00000571814.1 0
# -------
# seqinfo: 1 sequence from hg38 genome
# Use the geneId to examine if the genes associated with the peaks in TSS regions show some interesting functional enrichment
peak_final_anno_tss$geneId
```