|
| 1 | +--- |
| 2 | +title: "Retrieving fastq files, Quality Assessment and counting" |
| 3 | +author: "Mark Dunning" |
| 4 | +date: "25 July 2018" |
| 5 | +output: html_document |
| 6 | +--- |
| 7 | + |
| 8 | +```{r setup, include=FALSE} |
| 9 | +knitr::opts_chunk$set(echo = TRUE) |
| 10 | +``` |
| 11 | + |
| 12 | +# Command-line analysis |
| 13 | + |
| 14 | + |
| 15 | +## Retrieve the fastq file |
| 16 | + |
| 17 | +We can download a fastq file from the Short Read Archive, provided we know it's location, using a `wget` unix command |
| 18 | + |
| 19 | +```{bash eval=FALSE} |
| 20 | +
|
| 21 | +### DO NOT RUN |
| 22 | +wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP045/SRP045534/SRR1552444/SRR1552444.sra |
| 23 | +``` |
| 24 | + |
| 25 | + |
| 26 | +The ftp site for Sequencing read archive can be accessed at `ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/` |
| 27 | + |
| 28 | +from there you can navigate to the folder containing a particular sequencing run |
| 29 | + |
| 30 | + |
| 31 | +## Extract the fastq |
| 32 | + |
| 33 | +The `sra-toolkit` provides various utilities for dealing with files in this format, including converting to the more popular `fastq` format. |
| 34 | + |
| 35 | +```{bash eval=FALSE} |
| 36 | +## DO NOT RUN - it will take too long |
| 37 | +fastq-dump SRR1552444.sra |
| 38 | +``` |
| 39 | + |
| 40 | +## Run the fastqc tool |
| 41 | + |
| 42 | +```{bash eval=FALSE} |
| 43 | +fastqc SRR1552444.fastq |
| 44 | +``` |
| 45 | + |
| 46 | + |
| 47 | +## Download reference transcripts |
| 48 | + |
| 49 | +```{bash eval=FALSE} |
| 50 | +wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/gencode.vM18.transcripts.fa.gz |
| 51 | +wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/gencode.vM18.annotation.gtf.gz |
| 52 | +``` |
| 53 | + |
| 54 | +## Creating a salmon index |
| 55 | + |
| 56 | +```{bash eval=FALSE} |
| 57 | +salmon index -i gencode_18 -t gencode.vM18.transcripts.fa.gz |
| 58 | +``` |
| 59 | + |
| 60 | +## Salmon quantification |
| 61 | + |
| 62 | +```{bash eval=FALSE} |
| 63 | +salmon quant -i gencode_18 -p 6 --libType A \ |
| 64 | +--gcBias --biasSpeedSamp 5 \ |
| 65 | +-r SRR1552444.fastq -o SRR1552444 |
| 66 | +
|
| 67 | +``` |
| 68 | + |
| 69 | +```{bash eval=FALSE} |
| 70 | +ls SRR1552444 |
| 71 | +head SRR1552444/quant.sf |
| 72 | +
|
| 73 | +``` |
| 74 | + |
| 75 | +# Analysis in Rstudio |
| 76 | + |
| 77 | +```{r} |
| 78 | +library(tximport) |
| 79 | +library(GenomicFeatures) |
| 80 | +``` |
| 81 | + |
| 82 | +## Find the quant files |
| 83 | + |
| 84 | +```{r} |
| 85 | +
|
| 86 | +quant_files <- "SRR1552444/quant.sf" |
| 87 | +``` |
| 88 | + |
| 89 | +## Make a transcript mapping file |
| 90 | + |
| 91 | +```{r} |
| 92 | +library(stringr) |
| 93 | +tmp <- read.delim("SRR1552444/quant.sf",stringsAsFactors = FALSE) |
| 94 | +txMap <- str_split_fixed(tmp$Name, pattern = "\\|", n=8) |
| 95 | +tx2gene <- data.frame(TXNAME=tmp$Name,GENE=txMap[,2]) |
| 96 | +head(tx2gene) |
| 97 | +``` |
| 98 | + |
| 99 | +## Import the transcripts |
| 100 | + |
| 101 | +```{r} |
| 102 | +txi <- tximport(quant_files, type="salmon", tx2gene=tx2gene) |
| 103 | +names(txi) |
| 104 | +``` |
| 105 | + |
| 106 | +```{r} |
| 107 | +head(txi$abundance) |
| 108 | +head(txi$counts) |
| 109 | +``` |
| 110 | + |
| 111 | + |
| 112 | +## References |
| 113 | + |
| 114 | +[https://bioconductor.github.io/BiocWorkshops/rna-seq-data-analysis-with-deseq2.html](DESeq2 tutorial from Bioconductor 2018 conference) |
0 commit comments