Skip to content

Commit 410c600

Browse files
committed
Add new file for preparing fastq reads
1 parent 8520c93 commit 410c600

File tree

1 file changed

+114
-0
lines changed

1 file changed

+114
-0
lines changed

fastq.Rmd

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
---
2+
title: "Retrieving fastq files, Quality Assessment and counting"
3+
author: "Mark Dunning"
4+
date: "25 July 2018"
5+
output: html_document
6+
---
7+
8+
```{r setup, include=FALSE}
9+
knitr::opts_chunk$set(echo = TRUE)
10+
```
11+
12+
# Command-line analysis
13+
14+
15+
## Retrieve the fastq file
16+
17+
We can download a fastq file from the Short Read Archive, provided we know it's location, using a `wget` unix command
18+
19+
```{bash eval=FALSE}
20+
21+
### DO NOT RUN
22+
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP045/SRP045534/SRR1552444/SRR1552444.sra
23+
```
24+
25+
26+
The ftp site for Sequencing read archive can be accessed at `ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/`
27+
28+
from there you can navigate to the folder containing a particular sequencing run
29+
30+
31+
## Extract the fastq
32+
33+
The `sra-toolkit` provides various utilities for dealing with files in this format, including converting to the more popular `fastq` format.
34+
35+
```{bash eval=FALSE}
36+
## DO NOT RUN - it will take too long
37+
fastq-dump SRR1552444.sra
38+
```
39+
40+
## Run the fastqc tool
41+
42+
```{bash eval=FALSE}
43+
fastqc SRR1552444.fastq
44+
```
45+
46+
47+
## Download reference transcripts
48+
49+
```{bash eval=FALSE}
50+
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/gencode.vM18.transcripts.fa.gz
51+
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/gencode.vM18.annotation.gtf.gz
52+
```
53+
54+
## Creating a salmon index
55+
56+
```{bash eval=FALSE}
57+
salmon index -i gencode_18 -t gencode.vM18.transcripts.fa.gz
58+
```
59+
60+
## Salmon quantification
61+
62+
```{bash eval=FALSE}
63+
salmon quant -i gencode_18 -p 6 --libType A \
64+
--gcBias --biasSpeedSamp 5 \
65+
-r SRR1552444.fastq -o SRR1552444
66+
67+
```
68+
69+
```{bash eval=FALSE}
70+
ls SRR1552444
71+
head SRR1552444/quant.sf
72+
73+
```
74+
75+
# Analysis in Rstudio
76+
77+
```{r}
78+
library(tximport)
79+
library(GenomicFeatures)
80+
```
81+
82+
## Find the quant files
83+
84+
```{r}
85+
86+
quant_files <- "SRR1552444/quant.sf"
87+
```
88+
89+
## Make a transcript mapping file
90+
91+
```{r}
92+
library(stringr)
93+
tmp <- read.delim("SRR1552444/quant.sf",stringsAsFactors = FALSE)
94+
txMap <- str_split_fixed(tmp$Name, pattern = "\\|", n=8)
95+
tx2gene <- data.frame(TXNAME=tmp$Name,GENE=txMap[,2])
96+
head(tx2gene)
97+
```
98+
99+
## Import the transcripts
100+
101+
```{r}
102+
txi <- tximport(quant_files, type="salmon", tx2gene=tx2gene)
103+
names(txi)
104+
```
105+
106+
```{r}
107+
head(txi$abundance)
108+
head(txi$counts)
109+
```
110+
111+
112+
## References
113+
114+
[https://bioconductor.github.io/BiocWorkshops/rna-seq-data-analysis-with-deseq2.html](DESeq2 tutorial from Bioconductor 2018 conference)

0 commit comments

Comments
 (0)