-
Notifications
You must be signed in to change notification settings - Fork 19
Comparative Genomics Exercise 5: Pangenomics
#Pangenome Analysis and Phylogenetic Classification
This exercise guides you through analyzing the pangenome of six bacterial genomes and determining their phylogenetic relationships.
-
Login to the GDAV server:
$ ssh youruser@IP
-
Activate the conda environment:
$ conda activate gdav23
-
Create and navigate to a working directory:
$ mkdir ~/pangenomics $ cd ~/pangenomics
-
Verify input files:
All required files are available in
/home/compgenomics/pangenomics/. List the contents of this directory to familiarize yourself with the files:$ ls /home/compgenomics/pangenomics/
The following tools are pre-installed on the GDAV server:
- DIAMOND
-
MMSeqs2.
eval "$(/home/miniforge3/bin/conda shell.bash hook)" && conda activate gdav23 -
Prokka
eval "$(/home/root/miniforge3/bin/conda shell.bash hook)" && conda activate prokka
Task: Identify gene families that are part of the core-pangenome of six provided genomes.
Steps:
-
Use the ANI Calculator tool available at https://www.ezbiocloud.net/tools/ani.
-
Compare two genomes from
/home/compgenomics/pangenomics/genome_assemblies/.Question: Are they the same species?
-
Compare any genome in
genome_assemblieswith the genome at/home/compgenomics/pangenomics/other_genomes/another_genome.fna.Question: Are these genomes from microorganisms in the same species, genus, or family rank?
Command:
$ for x in /home/compgenomics/pangenomics/genome_assemblies/*.fna; do \
prokka --kingdom Bacteria --prefix `basename $x` \
--outdir prokka/`basename $x`.prokka $x; \
done-
Prokka may take several minutes to complete.
-
Precomputed results are available at
/home/compgenomics/pangenomics/prokka.Question: How many proteins (coding sequences) are detected in each genome?
-
Concatenate protein sequences into a single file:
$ cat /home/compgenomics/pangenomics/prokka/GCA*/*.faa > all_proteins.faa
-
Create an MMSeqs2 database:
$ mmseqs createdb all_proteins.faa all_proteins.db
-
Cluster proteins with minimum coverage of 30% and identity of 20%:
$ mmseqs cluster -c 0.3 --min-seq-id 0.2 all_proteins.db clustering.db tmp
-
Convert clustering results into a TSV format:
$ mmseqs createtsv all_proteins.db all_proteins.db clustering.db clustering.tsv
Questions:
- How many clusters did you find?
- How many singletons are there?
- How many families could be considered part of the core genome?
- What is the effect of varying identity and coverage thresholds? Try a stricter parameter set (e.g., 75% identity).
Discussion: Refer to the MMSeqs2 documentation. How would you modify the approach to cluster metagenomic sequences?
Objective: Build a phylogenetic tree to determine if the problem genomes belong to the same species and identify the species.
Steps:
- Identify (at your choice) one or various clusters of conserved core homologous proteins among the six genomes analyzed in the previous section.
- Use BLAST to compare these sequences against the reference proteomes at
/home/compgenomics/pangenomics/other_proteomes.faa. - Retrieve homologous sequences from reference proteomes (you can use the
extract_seqs_from_blast_result.pyscript from previous exercises) - Build a multiple sequence alignment and construct a phylogenetic tree for each of the selected clusters.
Tasks:
- Visualize the tree.
- Interpret the results to determine:
- If all problem genomes belong to the same species.
- The species or genus these genomes belong to.