The aim of this tutorial is to provide an introduction to the basic steps in an assembly based metagenomics analysis. We will perform taxonomic annotation of reads, perform a coassembly, generate MAGs and place them in a phylogenetic tree, using the command line.
The workflow is quite typical and will involve:
This will involve a collection of different software programs:
-
kraken2: A kmer based read profiler
-
megahit: A highly efficient metagenomics assembler currently our default for most studies
-
bwa: Necessary for mapping reads onto contigs
-
samtools: Utilities for processing mapped files
-
Metabat2: an automatic binning algorithm
-
checkm: Tools to assess bin quality
-
gtdb-tk: Toolkit to place MAG on reference phylogenetic tree and use placement for taxonomy classification.
Note this tutorial is configured for the Birmingham VMs on the Warwick VMs (INIVM1-*) that kraken database was placed in '~/Databases' so simply change path in Kraken commands replaced repos with Databases.
Please ssh to your vm using the -Y option so that X forwarding can be done.
ssh -Y ubuntu@xxx.xxx.xxx.xxx
We use a conda env to install all dependencies, you don't need to install anything and all dependencies are available but only inside that environment.
Try to check the command line help of megahit
megahit -h
not working?
Conda environment are created as independant environment to everything else, you need to "activate" an environment to be able to access the sets of tools installed inside.
conda env list
conda activate Intro
megahit -h
Let's create a Projects directory and work inside:
mkdir -p ~/Projects/AD_binning
Anaerobic digester metagenomic time series subsampled for this tutorial, reads mapping only to a few bins of interest. Please download and extract the dataset using this link:
cd ~/Data
wget http://seb.s3.climb.ac.uk/strain_practial_data.tar.gz
tar -xvf strain_practial_data.tar.gzOur first task it to profile one of the samples with Kraken2. The Kraken2 database is here ~/Databases/MiniKraken. Can you figure out how to run sample1 forward reads through Kraken2 and generate a report using 8 threads - '--use-names' is also a useful flag? Do this in a new directory Kraken.
~~Hint~~
If you don't know how to use Kraken2, type it in the terminal, it will show you which argument are needed. If you can't make sense of what's written, have a look on the internet for the full documentation.
spoiler
cd ~/Projects/AD_binning
mkdir Kraken
cd Kraken
kraken2 --db ~/repos/MiniKraken ~/Data/AD_small/sample1/sample1_R1.fastq --threads 8 --use-names --report kraken_report.txt --output kraken_sample1We can visualise the kraken report as a Krona plot. Before doing that will need to run this step just once (on Birmingham VMs only):
cd ~/repos/miniconda3/envs/Intro/opt/krona
./updateTaxonomy.sh
cd ~/Projects/AD_binning/Kraken
:
ktImportTaxonomy -q 1 -t 5 kraken_report.txt -o kraken_krona_report.htmlThis html output will have to be downloaded onto your own computer if you want to open it using a browser:
scp ubuntu@xxx.yyy.zzz.vvv:~/Projects/AD_binning/Kraken/kraken_krona_report.html .We are going to use megahit for assembly. It is a fast memory efficient metagenomic assembler and particularly useful for handling large coassembly or large datasets.
Why would you want to do an assembly or a coassembly?
Megahit is installed, reads are at
~/Data/AD_small
Bioinformatics is mostly about reading documentation, and looking on the internet how to do things. Use the -h flag on megahit and try to craft a command line to launch the assembly.
spoiler
cd ~/Projects/AD_binning
ls ~/Data/AD_small/*/*R1.fastq | tr "\n" "," | sed 's/,$//' > R1.csv
ls ~/Data/AD_small/*/*R2.fastq | tr "\n" "," | sed 's/,$//' > R2.csv
megahit -1 $(<R1.csv) -2 $(<R2.csv) -t 8 -o Assembly --k-step 24It should take about 10 mins
What is the output of an assembler? How good is the assembly? How would we estimate the number of organisms in the assembly?
We now have an assembly with contigs being quite larger than the initial reads. Let's try to use kraken2 for taxonomic classification. Please take some time to modify previous command line to launch Kraken2 on the assembly.
spoiler
echo "Comme on I believe in you, you can change ~/Data/AD_small/sample1/sample1_R1.fastq by the correct path to the assembly (final.contigs.fa)"spoiler
echo "also be sure to change name of the report and name of the output, otherwise you're going to erase previous results"spoiler
cd ~/Projects/AD_binning/Kraken
kraken2 --db ~/repos/MiniKraken ~/Projects/AD_binning/Assembly/final.contigs.fa --threads 8 --use-names --report kraken_assembly_report.txt --output kraken_assembly
ktImportTaxonomy -q 1 -t 5 kraken_assembly_report.txt -o kraken_krona_assembly_report.htmlHow many contigs have been classified? Why did this happen?

What kind of information can be used to bins contigs?
We use bwa mem to map reads to the assembly. As preliminary step we need to index the assembly
cd ~/Projects/AD_binning/Assembly
bwa index final.contigs.fa
cd ..Then I want you to try mapping the reads from the first sample contigs using 'bwa mem' inside a subdirectory Map to produce a sam file 'Map/sample1.sam':
the correct command is:
mkdir Map
bwa mem -t 4 Assembly/final.contigs.fa ~/Data/AD_small/sample1/sample1_R1.fastq ~/Data/AD_small/sample1/sample1_R2.fastq > Map/sample1.samYou can look at the sam:
tail Map/sample1.sam
It is quite a complex format
The sam file is a bit bulky so we never store alignments in this format instead we would convert it into its binary version: bam. Can you convert this file using the command
samtools view
Convert sam to bam command
cd Map
samtools view -h -b -S sample1.sam > sample1.bamUsing samtool we can filter only those reads which are mapped to the assembly.
samtools view -b -F 4 sample1.bam > sample1.mapped.bamFor downstream analysis we needs the bam file to be sorted:
samtools sort sample1.mapped.bam -o sample1.mapped.sorted.bam
To run all samples we would place these steps in a shell script:
cd ~/Projects/AD_binning
rm ~/Projects/AD_binning/Map/*
for file in ~/Data/AD_small/*/*R1.fastq
do
stub=${file%_R1.fastq}
name=${stub##*/}
echo $name
file2=${stub}_R2.fastq
bwa mem -t 4 Assembly/final.contigs.fa $file $file2 | samtools view -b -F 4 - | samtools sort - > Map/$name.mapped.sorted.bam
doneThe for loop must be pasted as one chunk of text into the terminal or create a small shell script to store commands. Can you make sense of what that script does?
The first step is to derive coverage from bam files. For this we can use metabat2 script. It takes bam files as input and produce a table of mean coverage depth and corresponding std for each contigs in each sample.
cd ~/Projects/AD_binning/Map
jgi_summarize_bam_contig_depths --outputDepth depth.txt *.bamMake a new subfolder Binning. Move the Coverage file into this and look into crafting the metabat2 command line. Either use the help flag or a quick google search.
Solution
cd ~/Projects/AD_binning
mkdir Binning
mv Map/depth.txt Binning/depth.txt
metabat2 -i Assembly/final.contigs.fa -a Binning/depth.txt -t 4 -o Binning/Bins/BinHow many contigs were clustered?
cd ~/Projects/AD_binning/Binning/Bins
grep -c ">" *.fa | awk -F: '{ s+=$2 } END { print s }'How many nucleotide were clustered?
grep -v ">" *.fa |wc -mA bin is a group of contigs put together from looking at coverage/composition. How do you assess bin quality?
Checkm is an handy automated pipeline which will use marker set specifics to bacteria/Archea to assess contamination/completion.
cd ~/Projects/AD_binning/Binning
checkm lineage_wf Bins/ checkm -x .faAfter launching checkm, are you having an issue?
Unfortunately the vm are a bit short on ram and pplacer, used by checkm to identify SCG taxonomy, is extremely ram greedy. Instead you will need to import output pre-generated for this tutorial.
rm -r checkm
ln -s ~/repos/strain_resolution_practical/Intro_prerun/checkm.tsv .What does each column mean?
When doing metagenomic, it happens often that the MAGs you obtain are not in database, or just distantly related to references. Using single copy core genes we can place these MAGs in a phylogenetic tree full with known species representatives. The gtdb toolkit does that for you:
cd ~/Projects/AD_binning/Binning
gtdbtk classify_wf --cpus 4 --genome_dir Bins --out_dir gtdb --extension .fa --scratch_dir gtdb/scratchThat will take at least XXXX min.
So instead lets have a look at the prerun results.
mkdir -p ~/Projects/AD_binning/Binning/checkm
cd ~/Projects/AD_binning/Binning/checkm
ln -s ~/repos/strain_resolution_practical/Intro_prerun/*.summary.tsv .We obtain multiple files what are they?
Look at summary files what information can we obtain.
cut -f1,2,14,19 ~/Projects/AD_binning/Binning/checkm/*.summary.tsvWhat is the RED, from gtdb?
