-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path08-PacBioAssembly.Rmd
More file actions
139 lines (91 loc) · 4.24 KB
/
08-PacBioAssembly.Rmd
File metadata and controls
139 lines (91 loc) · 4.24 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# Pacbio Shotgun Assembly and Binning
## Data
We'll use one of the PacBio shotgun metagenome samples obtained from European beech (Fagus sylvatica L.) deadwood. See the "Additional Data" chapter to see more information about the study and additional samples and sequence datatypes that are available for practice.
The data for SRR28211699 has been downloaded (/home/data/metagenomics/pacbio_shotgun/SRR28211699.fastq.gz). It is a sample collected from deadwood from a tree that died between 20 and 41 years before collection. It has ~5.4Gb of data.
## Assembly
Open up a screen. We can just reconnect to the assembly screen we made earlier.
```{bash,eval=FALSE}
screen -dr assembly
```
Set up a workspace.
```{bash,eval=FALSE}
mkdir -p ~/deadwood-assembly/SRR28211699
cd ~/deadwood-assembly/SRR28211699
```
Activate the environment.
```{bash,eval=FALSE}
conda activate flye
```
Assemble the sample with metaFlye.
```{bash,eval=FALSE}
flye --pacbio-hifi /home/data/metagenomics/pacbio_shotgun/SRR28211699.fastq.gz \
--iterations 1 --meta --threads 20 -o .
```
NOTE: Flye can do scaffolding using information in the assembly graph even though we don't have paired end reads, genetic maps, or other scaffolding datatypes.
Compress the output. This will create a file called assembly.fasta.gz
```{bash,eval=FALSE}
gzip assembly.fasta
```
Check the number of sequences.
```{bash,eval=FALSE}
zgrep -c '>' assembly.fasta.gz
```
At this point, you could check the assembly with checkm as we did in the previous chapter. But let's go ahead and do the binning, then we'll run checkm on the bins.
## Binning
We will use MetaBAT2 (Metagenome Binning based on Abundance and Tetranucleotide frequency) to bin our assembly sequences by organism. It is in the flye environment, which should already be activated. MetaBAT2 used tetra-nucleotide frequencies and abundance (coverage) to bin sequences from the same organism.
The paper:
https://pmc.ncbi.nlm.nih.gov/articles/PMC6662567/
Best binning practices:
https://bitbucket.org/berkeleylab/metabat/wiki/Best%20Binning%20Practices
<!--
NOTE: Minimap2 didn't align well. Most of the contigs have 0 read support. We'll use the flye coverage estimates. Flye does use minimap2 to align reads to the assembly so I'm not sure what is going on. I tried hifi, pb, and asm20 presets.
First, we need to align the reads used to generate the assembly back to the assembly. Metabat2 will use this to calculate how much coverage (or read support) each contig had in our read dataset. It will use coverage to help group the contigs by organism, given that contigs from the same organism should have similar coverage. We will use minimap2, which is installed in the flye environment. We will also samtools to convert from sam to bam and then to sort the bam file.
Github:
https://github.com/lh3/minimap2
Paper:
https://academic.oup.com/bioinformatics/article/34/18/3094/4994778
**Parameters**
minimap2
-a   output SAM format
-x   preset (we'll use parameters preset for pacbio hifi = map-hifi)
samtools (samtools view and sort)
-   STDIN
-b   output BAM format
-o   output file
```{bash,eval=FALSE}
# Create index
minimap2 -x map-hifi -d assembly.mmi assembly.fasta.gz
# Run the alignment
minimap2 -a assembly.mmi \
/home/data/metagenomics/pacbio_shotgun/SRR28211699.fastq.gz | \
samtools view -b - | samtools sort -o readsxassembly.sort.bam -
```
-->
<!--samtools is installed in /usr/bin/-->
<!--
Make a directory for the binning output.
```{bash,eval=FALSE}
mkdir metabat2_bins
```
-->
Bin the assembly contigs.
**Parameters**
--inFile Assembly contigs (gzipped fasta)
--outFile Basename and path for output bins
--minContig Minimum contig size for binning (default: 2500; must be >=1500)
--verbose Gives more output information
<!--
```{bash,eval=FALSE}
runMetaBat.sh --minContig 1500 \
assembly.fasta.gz readsxassembly.sort.bam
```
-->
Get the first 3 columns from the assembly_info.txt file.
```{bash,eval=FALSE}
cut -f 1-3 assembly_info.txt > depth.txt
```
```{bash,eval=FALSE}
metabat2 -i assembly.fasta.gz -o metabat2_bins/bin --cvExt -a depth.txt --minContig 1500 --verbose
```
## Quality Check
Now try to run checkm on the bins.