ATACSeq/CallPeak.Rmd

---
title: "ATAC-Seq Data Analysis with Human Data - Call Peaks, Perform Quality Control, and Annotate Peaks"
author: "Anni Liu"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output: 
  html_document:
    code_folding: show
---

```{r, shorcut, include=FALSE}
## RStudio keyboard shortcut
# Cursor at the beginning of a command line: Ctrl+A
# Cursor at the end of a command line: Ctrl+E
# Clear all the code from your console: Ctrl+L
# Create a pipe operator %>%: Ctrl+Shift+M (Windows) or Cmd+Shift+M (Mac)
# Create an assignment operator <-: Alt+- (Windows) or Option+-(Mac) 
# Knit a document (knitr): Ctrl+Shift+K (Windows) or Cmd+Shift+K (Mac)
# Comment or uncomment current selection: Ctrl+Shift+C (Windows) or Cmd+Shift+C (Mac)
```

# Peak calling
## Install MACS2 using `pip`
```{bash}
 /Users/your_name/Library/r-miniconda-arm64/bin/pip install macs2
 # macs2 and pip are in the same virtual environment
```

## Install MACS2 using `conda`
```{bash}
# Reference: https://docs.conda.io/projects/conda/en/latest/commands/install.html
conda install -p /Users/your_name/Library/r-miniconda-arm64/envs/atac -c "bioconda/label/cf201901" macs2
```

## Install MACS2 using `Herper`
```{r}
# Reference: 
# https://www.bioconductor.org/packages/release/bioc/vignettes/Herper/inst/doc/QuickStart.html
# https://anaconda.org/bioconda/macs2
library(Herper)
tool_dir <- install_CondaTools(tools = "macs3", env = "atac", updateEnv = TRUE)
# pathToMiniConda: NULL Path to miniconda installation
# updateEnv	Update existing package's conda environment if already installed.

# Retured outcomes:
# * Using Miniconda at: /Users/your_name/Library/r-miniconda-arm64
# The environment 'atac' already exists but the tools were not installed because the 'updateEnv' argument was set to FALSE. 
# 
# Conda and Environment Information
# pathToConda : /Users/your_name/Library/r-miniconda-arm64/bin/conda
# environment : atac
# pathToEnvBin : /Users/your_name/Library/r-miniconda-arm64/envs/atac/bin

tool_dir
# $pathToConda
# [1] "/Users/your_name/Library/r-miniconda-arm64/bin/conda"
# 
# $environment
# [1] "atac"
# 
# $pathToEnvBin
# [1] "/Users/your_name/Library/r-miniconda-arm64/envs/atac/bin"
```

## [Outdated] Nucleosome-free peak calling from single-end ATAC-seq data
As we do not know the length of fragments, we typically shift the cutting ends (5') of a read by -100 in 3'->5' direction and then extend reads in 5'->3' direction by 200, which gives us a wider region to increase the chance of calling the peaks in nucleosome-free open regions.
```{r}
# Reference: 
# https://pypi.org/project/MACS2/
# To find enriched cutting sites such as some DNAse-Seq datasets. In this case, all 5' ends of sequenced reads should be extended in both directions to smooth the pileup signals. If the wanted smoothing window is 200bps, then use --nomodel --shift -100 --extsize 200.
# For certain nucleosome-seq data, we need to pile up the centers of nucleosomes using a half-nucleosome size for wavelet analysis (e.g. NPS algorithm). Since the DNA wrapped on nucleosome is about 147bps, this option can be used: --nomodel --shift 37 --extsize 73.

# https://hbctraining.github.io/Intro-to-ChIPseq-flipped/lessons/06_peak_calling_macs.html
with_CondaEnv("atac",
              system2(command = "macs2",
                      args = c("callpeak",
                               "-t", "single_end.bam",
                               "--nomodel",
                               "--shift", "-100",
                               "--extsize", "200",
                               "--format", "BAM",
                               "-g", "hs"),
                      stdout = T))
# -g: mappable genome size which is defined as the genome size which can be sequenced; some precompiled values provided. The default hs -- 2.7e9 is recommended for human genome. Here are all precompiled parameters for effective genome size: hs: 2.7e9 | mm: 1.87e9 | ce: 9e7 | dm: 1.2e8
# stdout: where output to ‘stdout’ should be sent. Possible values are "", to the R console (the default), NULL or FALSE (discard output), TRUE (capture the output in a character vector) or a character string naming a file.
```

## [Outdated] Nucleosome occupied peak calling from single-end ATAC-seq data
As we do not know the length of fragments, we typically shift the cutting ends (5') of a read by 37 in 3'->5' direction and then extend reads in 5'->3' direction by 73.
```{r}
with_CondaEnv("atac",
              system2(command = "macs2", 
                      args = c("callpeak", 
                               "-t", "single_end.bam",
                               "--nomodel",
                               "--shift", "37", # Notice
                               "--extsize", "73", # Notice
                               "--format", "BAM",
                               "-g", "hs"),
                      stdout = T))
```

## Nucleosome-free peak calling from paired-end sequencing - no concerns on the correct specifications of `shift` and `extsize`
```{bash}
# Reference: https://pypi.org/project/MACS2/
# https://biohpc.cornell.edu/lab/doc/Chip-seq_workshop_lecture1.pdf
# -N/--NAME
# The name string of the experiment. MACS will use this string NAME to create output files like NAME_peaks.xls, NAME_negative_peaks.xls, NAME_peaks.bed , NAME_summits.bed, NAME_model.r and so on. So please avoid any confliction between these filenames and your existing files.

# -F/--FORMAT FORMAT
# Format of tag file can be ELAND, BED, ELANDMULTI, ELANDEXPORT, SAM, BAM, BOWTIE, BAMPE, or BEDPE. Default is AUTO which will allow MACS to decide the format automatically. AUTO is also useful when you combine different formats of files. Note that MACS can't detect BAMPE or BEDPE format with AUTO, and you have to implicitly specify the format for BAMPE and BEDPE.
# Nowadays, the most common formats are BED or BAM (including BEDPE and BAMPE). Our recommendation is to convert your data to BED or BAM first.
# Also, MACS2 can detect and read gzipped file. For example, .bed.gz file can be directly used without being uncompressed with --format BED.

# BEDPE or BAMPE
# A special mode will be triggered while the format is specified as BAMPE or BEDPE. In this way, MACS2 will process the BAM or BED files as paired-end data. Instead of building a bimodal distribution of plus and minus strand reads to predict fragment size, MACS2 will use actual insert sizes of pairs of reads to build fragment pileup.
# The BAMPE format is just a BAM format containing paired-end alignment information, such as those from BWA or BOWTIE.

/Users/your_name/Library/r-miniconda-arm64/bin/macs2 callpeak -t Documents/WCM_Project/ATAC_seq/Sorted_ATAC_female_lung_bowtie2_chr1X_free.bam --outdir Documents/WCM_Project/ATAC_seq/Sorted_ATAC_female_lung_bowtie2_chr1X_free_narrowpeak --format BAMPE --name female_lung -g hs
```


```{r suspend}
with_CondaEnv("atac",
              system2(command = "macs2",
                      args = c("callpeak",
                               "-t", "Documents/WCM_Project/ATAC_seq/Sorted_ATAC_female_lung_bowtie2_chr1X_free.bam", # Nucleosome-free open region BMA file
                               "--outdir", "Documents/WCM_Project/ATAC_seq/Sorted_ATAC_female_lung_bowtie2_chr1X_free_narrowpeak/",
                               "--format", "BAMPE",
                               "--name", "female_lung",
                               "-g", "hs"),
                      stdout = T))
```

## Use MACS3-based MACSr to call peaks in the nucleosome-free open regions from paired-end sequencing
```{r}
# Reference: https://github.com/macs3-project/MACS/blob/master/docs/callpeak.md
library(MACSr)
cp1 <- callpeak(tfile="Sorted_ATAC_female_lung_bowtie2_mainchr_free.bam", 
                gsize = "hs", 
                format = "BAMPE",
                name = "chr1_narrow", 
                outdir = "Sorted_ATAC_female_lung_bowtie2_mainchr_free_peak",
                cutoff_analysis = T)
```

# Quality control - evaluate quality of reads in peaks (e.g., duplication rate, low quality reads, and reads in artifact region)
## Create an object containing quality metrics computed for a ATAC-seq sample
```{r}
# Blacklisting prevents the problematic regions (e.g., centromeres, telomeres, mitochondrial DNA, dense repeat regions, resiongs with high GC content or segmental duplications)
# Blacklist for hg19: https://www.encodeproject.org/files/ENCFF001TDO/
# Blacklist for hg38: https://www.encodeproject.org/files/ENCFF356LFX/
library(ChIPQC)
library(rtracklayer)
library(DT)
library(dplyr)
library(tidyr)

(black_hg19 <- import.bed("ENCFF001TDO.bed.gz"))
# GRanges object with 411 ranges and 2 metadata columns:
#         seqnames            ranges strand |                   name     score
#            <Rle>         <IRanges>  <Rle> |            <character> <numeric>
#     [1]     chr1     564450-570371      * | High_Mappability_isl..      1000
#     [2]     chr1     724137-727043      * |       Satellite_repeat      1000
#     [3]     chr1     825007-825115      * |               BSR/Beta      1000
#     [4]     chr1   2583335-2634374      * | Low_mappability_island      1000
#     [5]     chr1   4363065-4363242      * |               (CATTC)n      1000
#     ...      ...               ...    ... .                    ...       ...
#   [407]     chrY 28555027-28555353      * |                   TAR1      1000
#   [408]     chrY 28784130-28819695      * |       Satellite_repeat      1000
#   [409]     chrY 58819368-58917648      * |               (CATTC)n      1000
#   [410]     chrY 58971914-58997782      * |               (CATTC)n      1000
#   [411]     chrY 59361268-59362785      * |                   TAR1      1000
#   -------
#   seqinfo: 25 sequences from an unspecified genome; no seqlengths
peak_open_region <- "Sorted_ATAC_peak.narrowPeak"
qc_res <- ChIPQCsample("Sorted_ATAC_open_region.bam", 
                       peaks = peak_open_region,
                       annotation = "hg19",
                       chromosomes = "chr20",
                       blacklist = black_hg19,
                       verboseT = FALSE)
# https://support.bioconductor.org/p/64185/
```


```{r}
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
(black_hg38 <- import.bed("ENCFF356LFX.bed.gz"))
# GRanges object with 910 ranges and 0 metadata columns:
#         seqnames            ranges strand
#            <Rle>         <IRanges>  <Rle>
#     [1]     chr1     628904-635104      *
#     [2]     chr1   5850088-5850571      *
#     [3]     chr1   8909611-8910014      *
#     [4]     chr1   9574581-9574997      *
#     [5]     chr1 32043824-32044203      *
#     ...      ...               ...    ...
#   [906]     chrY 11290798-11334278      *
#   [907]     chrY 11493054-11592850      *
#   [908]     chrY 11671015-11671046      *
#   [909]     chrY 11721529-11749472      *
#   [910]     chrY 56694633-56889743      *
#   -------
#   seqinfo: 24 sequences from an unspecified genome; no seqlengths
peak_all_region <- "Sorted_ATAC_female_lung_bowtie2_peak/female_lung_peaks.Peak"
qc_res <- ChIPQCsample("Sorted_ATAC_female_lung_bowtie2.bam", 
                       peaks = peak_all_region,
                       annotation = "hg38",
                       chromosomes = "chr17",
                       blacklist = black_hg38,
                       verboseT = FALSE)
```


## Compute QC metrices 
```{r}
# Calculate percentage of reads in the peaks and reads in the blacklist
met_res <- QCmetrics(qc_res)
met_res[c("RiP%", "RiBL%")]
#    RiP%   RiBL% 
# 35.0000  0.0245 
   
# Calculate the percentage of duplicated reads among mapped reads
(flag_count <- flagtagcounts(qc_res))
         # UnMapped            Mapped        Duplicates          MapQPass    MapQPassAndDup DuplicateByChIPQC 
         #        0           2175486                 0           1928324                 0            325481 

(duplicate_rate <- flag_count["DuplicateByChIPQC"] / flag_count["Mapped"] * 100)
# DuplicateByChIPQC 
#           14.9613 
```


## Remove blacklisted peaks
Notice that NOT remove blacklisted peaks earlier, as such removal will hide the QC issues
```{r}
(target_call <- granges(qc_res[seqnames(qc_res) %in% "chr17"]))
# GRanges object with 4366 ranges and 0 metadata columns:
#          seqnames            ranges strand
#             <Rle>         <IRanges>  <Rle>
#      [1]    chr17       69625-69763      *
#      [2]    chr17       76711-76993      *
#      [3]    chr17     111994-112405      *
#      [4]    chr17     117170-118055      *
#      [5]    chr17     122955-123297      *
#      ...      ...               ...    ...
#   [4362]    chr17 83203172-83203459      *
#   [4363]    chr17 83205621-83205771      *
#   [4364]    chr17 83208395-83208833      *
#   [4365]    chr17 83229113-83229258      *
#   [4366]    chr17 83247185-83247332      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

data.frame(Peak_blacklisted = sum(target_call %over% black_hg38), # %over%: Finding overlapping ranges
           Peak = sum(!target_call %over% black_hg38))

(peak_final <- target_call[!target_call %over% black_hg38])
# GRanges object with 4334 ranges and 0 metadata columns:
#          seqnames            ranges strand
#             <Rle>         <IRanges>  <Rle>
#      [1]    chr17       69625-69763      *
#      [2]    chr17       76711-76993      *
#      [3]    chr17     111994-112405      *
#      [4]    chr17     117170-118055      *
#      [5]    chr17     122955-123297      *
#      ...      ...               ...    ...
#   [4330]    chr17 83203172-83203459      *
#   [4331]    chr17 83205621-83205771      *
#   [4332]    chr17 83208395-83208833      *
#   [4333]    chr17 83229113-83229258      *
#   [4334]    chr17 83247185-83247332      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

saveRDS(peak_final, file = "peak_final.RData")
```

# Peak annotation 
## Associate peaks with genomic features (e.g., genes, promoters, enhancers)
```{r}
# Annotate each peak region with its closest gene
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
(peak_final_anno <- annotatePeak(peak = peak_final, TxDb = TxDb.Hsapiens.UCSC.hg38.knownGene))

# Display the annotation information - where the peaks locate in the genome
plotAnnoPie(peak_final_anno)
# Promoter (<=1kb)	31.7951084		
# 10	Promoter (1-2kb)	6.0913706		
# 11	Promoter (2-3kb)	5.1453623	
```


## Annotate peaks in TSS regions with genomic features
```{r}
peak_final_anno_2 <- as.GRanges(peak_final_anno)

# Select peaks landing in TSS regions (+/-500)
(peak_final_anno_tss <- peak_final_anno_2[abs(peak_final_anno_2$distanceToTSS) < 500])

# GRanges object with 1177 ranges and 9 metadata columns:
#          seqnames            ranges strand |       annotation   geneChr geneStart   geneEnd geneLength
#             <Rle>         <IRanges>  <Rle> |      <character> <integer> <integer> <integer>  <integer>
#      [1]    chr17       76711-76993      * | Promoter (<=1kb)        17     64099     76866      12768
#      [2]    chr17     215963-216247      * | Promoter (<=1kb)        17    213324    215933       2610
#      [3]    chr17     334101-334582      * | Promoter (<=1kb)        17    319522    334066      14545
#      [4]    chr17     409886-410730      * | Promoter (<=1kb)        17    404541    410031       5491
#      [5]    chr17     560021-560327      * | Promoter (<=1kb)        17    551602    559898       8297
#      ...      ...               ...    ... .              ...       ...       ...       ...        ...
#   [1173]    chr17 82931632-82932035      * | Promoter (<=1kb)        17  82931949  82942571      10623
#   [1174]    chr17 82932766-82933033      * | Promoter (<=1kb)        17  82932687  82937847       5161
#   [1175]    chr17 83048574-83048763      * | Promoter (<=1kb)        17  82943791  83048782     104992
#   [1176]    chr17 83051487-83052188      * | Promoter (<=1kb)        17  82942149  83051770     109622
#   [1177]    chr17 83083363-83084219      * | Promoter (<=1kb)        17  83084073  83094844      10772
#          geneStrand      geneId      transcriptId distanceToTSS
#           <integer> <character>       <character>     <numeric>
#      [1]          2   105377826 ENST00000623180.1             0
#      [2]          2        9501 ENST00000572965.1           -30
#      [3]          2        9501 ENST00000573588.5           -35
#      [4]          2   105371430 ENST00000599026.1             0
#      [5]          2       55275 ENST00000679978.1          -123
#      ...        ...         ...               ...           ...
#   [1173]          1        6904 ENST00000576603.5             0
#   [1174]          1        6904 ENST00000573364.1            79
#   [1175]          2      146712 ENST00000570947.5            19
#   [1176]          2      146712 ENST00000320865.4             0
#   [1177]          1      284207 ENST00000571814.1             0
#   -------
#   seqinfo: 1 sequence from hg38 genome

# Use the geneId to examine if the genes associated with the peaks in TSS regions show some interesting functional enrichment
peak_final_anno_tss$geneId 
```