Glaucoma-variant-calling/Fayrouz at master · SamahElmahdi/Glaucoma-variant-calling · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159

### Variant calling for Glaucoma (primary closed angle glucoma PCAG)####

For our project we have used the reads from a bioproject for studing the Homo sapiens gemonics of Glaucoma the study was performed to identify the possible candidate gene variations involved in glaucoma pathology. We had sequenced both familial and single samples for the study.
https://www.ncbi.nlm.nih.gov/bioproject/394051

#creat enviroment#
conda activate ngs1

### Creat Directory###
cd workdir
mkdir Glaucomaproject && cd Glaucomaproject

### Get Sample reads (paired end reads by Ilumina)
###Select only 8 million readsby usind SRAtoolkit

***Install SRAtoolkit***

wget "http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gz"
tar -xzf sratoolkit.current-centos_linux64.tar.gz
cd sratoolkit/ cd bin
./fastq-dump
sudo apt sratoolkit (it was asked to install)
./vdb-config -i # (enable remote access- enter- save- exit)

### https://ncbi.github.io/sra-tools/fastq-dump.html (Fastq-dump tool link)

fastq-dump -I --split-files -X 8000000 SRR5858157
### 8 million reads were selected to represent a sample reads (SRR5858157) with read length 93

*****### Fastqc for the two read sequences*****
mkdir fastqc && cd fastqc
for f in ~/workdir/Glaucomaproject/samples*.fastq.gz;do fastqc -t 1 -f fastq -noextract $f;done

##the results showed that Sequences flagged as poor quality are zero, all qc were very good
No need for trimming.
##Basic statstics
Measure                                    Value
Filename                               SRR5858157_1.fastq.gz
File type                              Conventional base calls
Encoding                               Sanger / Illumina 1.9
Total Sequences                        8000000
Sequences flagged as poor quality      0
Sequence length                        93
%GC                                    45


Filename                              SRR5858157_2.fastq.gz
File type                             Conventional base calls
Encoding                              Sanger / Illumina 1.9
Total Sequences                       8000000
Sequences flagged as poor quality     0
Sequence length                       93
%GC                                   46


*****Alignement by splicing aware aligner Hisat2****

##for alignement with hisat2 as the reads represents exome
## Reference chromosome 20
 ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz
## annotated whole genome
ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.chr.gtf.gz
##select chromosome 20
gunzip Homo_sapiens.GRCh38.99.chr.gtf.gz -k
grep "^#" Homo_sapiens.GRCh38.99.chr.gtf > Homo_sapiens.GRCh38.99.chr_20.gtf
grep "^20" Homo_sapiens.GRCh38.99.chr.gtf >> Homo_sapiens.GRCh38.99.chr_20.gtf


##***Indexing step***##

mkdir hisatindex && cd hisatindex
gunzip ~/workdir/Glaucomaproject/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz -k
hisat2_extract_splice_sites.py Glaucomaproject/Homo_sapiens.GRCh38.99.chr_20.gtf > splicesites.tsv
hisat2_extract_exons.py Glaucomaproject/Homo_sapiens.GRCh38.99.chr_20.gtf > exons.tsv
hisat2-build -p 1 --ss splicesites.tsv --exon exon.tsv Glaucomaproject/Homo_sapiens.GRCh38.dna.chromosome.20.fa Homo_sapiens.GRCh38.dna.chr.20

## **Creat read group and Alignment**##

R1="$HOME/workdir/Glaucomaproject/sample/SRR5858157_1.fastq"
R2="$HOME/workdir/Glaucomaproject/sample/SRR5858157_2.fastq"
hisat2 -p 1 --rg ID:SRR5858157_F8_III.4 --rg SM:SRR5858157_F8_III.4 --rg PL:ILLUMINA --rg LB:SRR5858157_F8_III.4 --rg PU:SRR5858157_F8_III.4 -x hisatindex/Homo_sapiens.GRCh38.chr20 -1 $R1 -2 $R2 -S SRR5858157.sam

###--rg LB:SRR5858157_F8_III.4 --rg PU:SRR5858157_F8_III.4 -x hisatindex/Homo_sapiens.GRCh38.chr20 -1 $R1 -2 $R2 -S SRR5858157.sam
8000000 reads; of these:
  8000000 (100.00%) were paired; of these:
    7833912 (97.92%) aligned concordantly 0 times
    160589 (2.01%) aligned concordantly exactly 1 time
    5499 (0.07%) aligned concordantly >1 times
    ----
    7833912 pairs aligned concordantly 0 times; of these:
      3505 (0.04%) aligned discordantly 1 time
    ----
    7830407 pairs aligned 0 times concordantly or discordantly; of these:
      15660814 mates make up the pairs; of these:
        15585822 (99.52%) aligned 0 times
        66968 (0.43%) aligned exactly 1 time
        8024 (0.05%) aligned >1 times
2.59% overall alignment rate


### GATK Directory##
mkdir GATk && cd GATk
### **Generate & sort BAM file** ####
samtools view -hbo SRR5858157.bam SRR5858157.sam
samtools sort SRR5858157.bam -o SRR5858157.sorted.bam

### **Mapping QC** ###
samtools depth SRR5858157.sorted.bam | awk '{{sum+=$3}} END {{print "Average = ",sum/NR, "No of covered Nuc = ", NR}}' > SRR5858157.cov
samtools flagstat SRR5858157.sorted.bam > SRR5858157.stat

### Average =  0.713272 No of covered Nuc =  51627763

##***Mark duplicate***##

picard_path=$CONDA_PREFIX/share/picard-* ## 2.21.7-0

for sample in *.sorted.bam;do
  name=${sample%.sorted.bam}
  java  -Xmx2g -jar $picard_path/picard.jar MarkDuplicates INPUT=$sample OUTPUT=$name.dedup.bam METRICS_FILE=$name.metrics.txt;
done

## METRICS CLASS        picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED     SECONDARY_OR_SUPPLEMENTARY_RDS  UNMAPPED_READS  U
SRR5858157_F8_III.4     69064   172557  61372   15585822        34402   0       0       0.083061

##**Indexing**##
# sample indexing
picard_path=$CONDA_PREFIX/share/picard-* ## 2.21.7-0
  java -Xmx2g -jar $picard_path/picard.jar BuildBamIndex VALIDATION_STRINGENCY=LENIENT INPUT=SRR5858157.dedup.bam OUTPUT=SRR5858157
  #Reference Indexing
  ln -s ~/workdir/Glaucomaproject/Homo_sapiens.GRCh38.dna.chromosome.20.fa
  java -Xmx2g -jar $picard_path/picard.jar CreateSequenceDictionary R=Homo_sapiens.GRCh38.dna.chromosome.20.fa O=Homo_sapiens.GRCh38.dna.chromosome.20.dict
samtools faidx Homo_sapiens.GRCh38.dna.chromosome.20.fa
#### output file Homo_sapiens.GRCh38.dna.chromosome.20.fa.fai####
##**Download known varinats##**
wget ftp://ftp.ensembl.org/pub/release-99/variation/vcf/homo_sapiens/homo_sapiens-chr20.vcf.gz
gunzip homo_sapiens-chr20.vcf.gz -k

## we don't need to change the chromosome from "20" to "chr20" as our files already donates the chromosome as number "20" only.
## Index features for the known variants##
gatk IndexFeatureFile -I homo_sapiens-chr20.vcf

##**Recalibrate Bases BQSR##**

## The base recalibration process involves two key steps: first one tool (BaseRecalibrator) builds a model of covariation based on the data and a set of known variants.
then Apply BQSR that adjusts the base quality scores in the data based on the model.

gatk --java-options "-Xmx2G" BaseRecalibrator \
-R Homo_sapiens.GRCh38.dna.chromosome.20.fa -I SRR5858157.dedup.bam --known-sites homo_sapiens-chr20.vcf \
-O SRR5858157.report
gatk --java-options "-Xmx2G" ApplyBQSR \
-R Homo_sapiens.GRCh38.dna.chromosome.20.fa -I SRR5858157.dedup.bam -bqsr SRR5858157.report \
-O SRR5858157.bqsr.bam --add-output-sam-program-record --emit-original-quals


###**Joint variant calling using HaplotypeCaller**###
## assess genotype likelihood per-sample
gatk --java-options "-Xmx2G" HaplotypeCaller -R Homo_sapiens.GRCh38.dna.chromosome.20.fa -I SRR5858157.bqsr.bam --emit-ref-confidence GVCF --pcr-indel-model NONE -O SRR5858157.gvcf