RNA-seq-data-analysis-workflow/episode_2.qmd at main · GenomicsAotearoa/RNA-seq-data-analysis-workflow · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
---
title: ""
code-overflow: wrap
---

# Quantifying Gene Expression

![](theme_images/alignment.png){width=80% fig-align="center"}


In order to draw conclusions about gene expression from our reads, we need to first identify which region of the genome reads originated from and then quantify the number of reads from that region. There are many tools available to carry out quantification, and we will not cover the arguments for and against each tool. We encourage you to carry out a literature search, and discuss with experienced individuals in your field, before committing to a particular quantification procedure. Here we will give a brief overview of two approaches with considerably different underlying methodology, give an example tool for each, and then demonstrate one tool for the purpose of progressing to the next stage of our workflow.

The first approach to read quantification involves aligning reads to the genome first, and then counting the number of reads that are aligned to certain genomic features (*e.g.,* exons). This method may generally be referred to as the 'count' method, but it is still a form of quantification, and the output is by default at the gene level.

The second approach, which we will not do today, involves creating transcript expression estimates via pseudoalignment, often referred to informally as 'pseudocounts'. In this method, reads are *not* aligned to the genome. Instead, they are pseudoaligned to kmers within the transcriptome and their abundance is estimated. While the output for these tools (e.g., Kallisto, Salmon) are by default at a transcript level, they can be converted to a gene level expression estimate.

In today's workflow we will demonstrate genome alignment and counting using HISAT2 and featureCounts.

## Alignment with HISAT2

### Preparation of the genome

When carrying out alignment, the first requirement is a genome which has been indexed. Indexing is a process to organise the genome so that our alignment algorithms can match reads to the genome easily, without having to scan the entire genome.

Navigate to the Genome directory and view the contents. You should see two files, a .gtf and a .fa file. We will use the .gtf file towards the end of this episode.

```bash
cd ~/RNA_seq/Genome

ls
```

The .gtf file (General Transfer Format) contains information about the genome, such as gene names *etc.,* all stored in a one-line-per-feature format, while the .fa (FASTA) file contains sequence.

We will now use hisat2 to index the genome using the `hisat2-build` command. We will specify the number of threads (`-p 4`), and specify the input file (`-f` for FASTA file type). The last argument is the prefix for the name of our .ht2 output file. You can use any name for this, but we recommend using something that specifies which version of the genome you used, for good scientific reproducibility.

```bash
# index file:
hisat2-build -p 4 -f Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa Saccharomyces_cerevisiae.R64-1-1.dna.toplevel

ls
```

We can now see many new .ht2 files, which make up the indexed version of the genome.

### Alignment on the genome

Now that we have an indexed genome, we can align our sequences. To do this we will need to know:

-   Where the sequence information is stored (*e.g.,* fastq files).

-   What kind of sequencing file we have (*e.g.,* Single end or Paired end).

-   Where the indexes and genome are stored.

-   Where the mapping files will be stored.

Once we have that information we are ready to align our sequences. First, navigate to the RNA_seq directory, and run `hisat2` on the first sample. We will use the `-x` flag to specify the basename of the index, the `-U` flag to specify the comma-separated list of files containing unpaired reads to be aligned, and the `-S` flag to write SAM alignment output files.

```bash
cd ~/RNA_seq

hisat2 -x Genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel -U Trimmed/SRR014335-chr1.fastq -S SRR014335.sam
```

Output should look something like this:

```
125090 reads; of these:

  125090 (100.00%) were unpaired; of these:

    15108 (12.08%) aligned 0 times

    88991 (71.14%) aligned exactly 1 time

    20991 (16.78%) aligned >1 times

87.92% overall alignment rate
```

Now that we have confirmed this code works for our first sample, we will use a loop to align the rest of the samples. We will make a new directory called Mapping, which we will use to store all of our mapping output files. Then navigate to the Trimmed directory and execute the for loop to align all samples.

```bash
mkdir Mapping

cd ~/RNA_seq/Trimmed

for filename in *
do
base=$(basename ${filename} .trimmed.fastq)
hisat2 -p 4 -x ../Genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel -U $filename -S ../Mapping/${base}.sam --summary-file ../Mapping/${base}_summary.txt
done
```

Check the output directory (Mapping) and see the different files that have been produced. You should see both .sam and _summary.txt files for each sample.

Let's look at the SAM file format.
```bash
less SRR014335-chr1.sam
```

The file begins with an optional header which is used to include human-readable metadata such as the source of the data, reference sequence, method of alignment *etc.,*. Following the header is the alignment section. Each line contains information corresponding to the alignment of a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner-specific information.


### Converting SAM to BAM

SAM files are tab-delimited text files containing information for each individual read and its alignment to the genome. SAM files are large, so the first thing we do with them is convert them to BAM files: compressed, binary versions of the file which reduce time and can themselves be indexed for efficiency when we interact with them.

We will convert the SAM file to the BAM format using the samtools program with the view command. Use the `-S` flag to specify the input is a SAM file, and the `-b` flag to specify BAM as the output.

Note that you will see a warning "fail to read the header from...", this is normal.
```bash
for filename in *.sam
do
base=$(basename ${filename} .sam)
samtools view -S -b ${filename} -o ${base}.bam
done

ls
```

Next we will sort the BAM files, which will mean our future steps are more efficient. From the samtools program we will use the sort command, with the `-o` flag to specify where the output file should go.

```bash
for filename in *.bam
do
base=$(basename ${filename} .bam)
samtools sort -o ${base}_sorted.bam ${filename}
done
```

### Mapping statistics

Using the sorted BAM file, you can now easily calculate some basic mapping statistics using the samtools flagstat command.

```bash
samtools flagstat SRR014335-chr1_sorted.bam
```

The output will look something like this:

```bash
161251 + 0 in total (QC-passed reads + QC-failed reads)
125090 + 0 primary
36161 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
146362 + 0 mapped (90.77% : N/A)
110201 + 0 primary mapped (88.10% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
```
*Note: the many "0s" here are normal for single end sequencing. The second column designated by x + 0 is always 0, as we do not have a pair. As there are no pairs or mates in SE sequencing, the outputs specific to those parameters are shown as 0 + 0.*

Basic statistics shown by flagstat will be slightly different from those in the summary file generated by HISAT2 due to different "totals" that are used for comparisons. `flagstat` compares the number of alignments while HISAT2 compares the number of reads mapped. This is because reads can be mapped/aligned to more than one reference location, and these reads have a "primary" and "secondary" alignment (see section 1.2 of the [SAM specifications](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf)). For example, the percent overall alignment in the HISAT2 summary will be equivalent to the percent primary mapped evaluated by flagstat. To get the number of reads that aligned 0 times (summary file), the equivalent statistic from flagstat would be subtracting the number of mapped reads from the number of total alignments.

### MultiQC

HISAT2 output data can be incorporated into the MultiQC report. Copy the summary files generated by HISAT2 into the MultiQC directory and re-run multiqc.

Navigate to the updated report and view it in your browser.
```bash
cd ~/RNA_seq/MultiQC
cp ../Mapping/*summary* ./
multiqc .
```
![MultiQC after alignment](images/MQC_hisat2.png)

You might notice that the report labels the summary files as "Bowtie2". This is because the outputs are identical to those generated by Bowtie2, and MultiQC recognises them as Bowtie2.


## Read summarisation

Sequencing reads often need to be assigned to genomic features of interest after they are mapped to the reference genome. This process is often called read summarisation or read quantification. Read summarisation is required by a number of downstream analyses such as gene expression analysis and histone modification analysis. The output of read summarisation is a count table, in which the number of reads assigned to each feature in each library is recorded. In our case each feature is an exon. To carry out counting we will use the featureCounts tool from the Subread package.

Navigate to the RNA_seq directory, create a new directory called Counts and `cd` into that directory. From there run the `featureCounts` command which will require five different flags. We will use the `-a` flag to specify the annotation file, in this case the .gtf file we noticed earlier. The `-o` flag names the output file (which we will call `yeast_counts.txt`), while the `-T` flag specifies the number of threads/CPUs used for mapping. The `-t` flag is used to control the feature type in the GTF annotation file, we will use exon (which is also the default). Finally, the `-g` flag specifies the attribute type in the GTF file. We will use the `gene_id` option (again, the default).
```bash
cd ~/RNA_seq

mkdir Counts

cd Counts

featureCounts -a ../Genome/Saccharomyces_cerevisiae.R64-1-1.99.gtf -o ./yeast_counts.txt -T 2 -t exon -g gene_id ../Mapping/*sorted.bam
```


Update the MultiQC report by copying over the output data in Counts.

```bash
cd ~/RNA_seq/MultiQC

cp ../Counts/* .

multiqc .
```


We have now generated counts! We can now take these through to the next stage of our analysis: preparing to identify differentially expressed genes.


## Extra: Automating and optimising your workflow

Here we have been running each step of fastqc, trimming, indexing, aligning and sorting directly on the command line. This is not always best practise, for a few reasons, and what you will likely need to do is translate this code into a **script**. For one, it is easier to reproduce and automate your workflow when everything you do is in a script. But perhaps more importantly, when working on a HPC environment, your direct command line inputs are operating on the head node or log in node. Many of these software used require more CPUs/memory than can be allocated on the head node and therefore you need to submit your job to the HPC job scheduler (*e.g.,* SLURM or PBS) to run the job more efficiently on allocated compute resources.

See here our GA workshop on [Bash Scripting and HPC Job Scheduler](https://genomicsaotearoa.github.io/BioinformaticsTrainingProgramme/portfolio.html#bash-script-hpc-job).

As a quick example, this below is what a slurm script may look like for running the index and align steps. Note the use of things like relative paths -- you need to adjust the paths to work from the directory in which your script is submitted from.

Slurm scripts are text files, and use the .sl extension by convention. You can use `nano` to create a script.


```bash
nano index-align.sl
```

```bash
#!/bin/bash

#SBATCH --job-name      index-and-align
#SBATCH --account       nesi02659
#SBATCH --time          00:10:00                 #Format is DD-HH:MM:SS
#SBATCH --cpus-per-task 8
#SBATCH --mem           10G
#SBATCH --output        slurmjob.%j.out
#SBATCH --error         slurmjob.%j.err

# Load required modules
module load HISAT2/2.2.0-gimkl-2020a


# Index genome
hisat2-build \
-p 2 \
-f ref_genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa \
ref_genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel

# Align to indexed genome
for filename in trimmed_reads/*.fastq
do
      # Extract base name
      base=$(basename ${filename} .fastq)

      # Align to the reference genome
      hisat2 \
      -p 2 \
      -x ref_genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel \
      -U $filename \
      -S results/sam/${base}.sam \
      --summary-file results/sam/${base}.summary.txt

done


```

This script only includes indexing and aligning, but you could include every step in the workflow in one script if you wish. Note that you can use a backslash (\\) to separate options/flags in your code across multiple lines to improve readability.

To submit the script, from your working directory you can type:

```bash
sbatch index-align.sl
```

To monitor the progress of the job, you can type:

```bash
squeue --me
```