Comparative Genomics Exercise 5: Pangenomics

#Pangenome Analysis and Phylogenetic Classification

This exercise guides you through analyzing the pangenome of six bacterial genomes and determining their phylogenetic relationships.

Environment Setup

Login to the GDAV server:
```
$ ssh youruser@IP
```
Activate the conda environment:
```
$ conda activate gdav23
```

Create and navigate to a working directory:

$ mkdir ~/pangenomics
$ cd ~/pangenomics

Verify input files:

All required files are available in /home/compgenomics/pangenomics/. List the contents of this directory to familiarize yourself with the files:
```
$ ls /home/compgenomics/pangenomics/
```

Tools Required

The following tools are pre-installed on the GDAV server:

MMSeqs2. eval "$(/home/miniforge3/bin/conda shell.bash hook)" && conda activate gdav23
Prokka eval "$(/home/root/miniforge3/bin/conda shell.bash hook)" && conda activate prokka

Guided Exercise

1. Core-Pangenome Analysis

Task: Identify gene families that are part of the core-pangenome of six provided genomes.

2. Calculate Average Nucleotide Identity (ANI)

Steps:

Use the ANI Calculator tool available at https://www.ezbiocloud.net/tools/ani.
Compare two genomes from /home/compgenomics/pangenomics/genome_assemblies/.

Question: Are they the same species?
Compare any genome in genome_assemblies with the genome at /home/compgenomics/pangenomics/other_genomes/another_genome.fna.

Question: Are these genomes from microorganisms in the same species, genus, or family rank?

3. Identify Open Reading Frames (ORFs) with Prokka

Command:

$ for x in /home/compgenomics/pangenomics/genome_assemblies/*.fna; do \
      prokka --kingdom Bacteria --prefix `basename $x` \
             --outdir prokka/`basename $x`.prokka $x; \
  done

Prokka may take several minutes to complete.
Precomputed results are available at /home/compgenomics/pangenomics/prokka.

Question: How many proteins (coding sequences) are detected in each genome?

4. Cluster Proteins into Families

Concatenate protein sequences into a single file:

$ cat /home/compgenomics/pangenomics/prokka/GCA*/*.faa > all_proteins.faa

Create an MMSeqs2 database:

$ mmseqs createdb all_proteins.faa all_proteins.db

Cluster proteins with minimum coverage of 30% and identity of 20%:

$ mmseqs cluster -c 0.3 --min-seq-id 0.2 all_proteins.db clustering.db tmp

Convert clustering results into a TSV format:
```
$ mmseqs createtsv all_proteins.db all_proteins.db clustering.db clustering.tsv
```
Questions:
- How many clusters did you find?
- How many singletons are there?
- How many families could be considered part of the core pangenome?
- What is the effect of varying identity and coverage thresholds? Try a stricter parameter set (e.g., 75% identity).
Discussion: Refer to the MMSeqs2 documentation. How would you modify the approach to cluster metagenomic sequences?

5. Phylogenetic Classification (optional)

Objective: Build a phylogenetic tree to determine if the problem genomes belong to the same species and identify the species.

Steps:

Identify (at your choice) one or various clusters of conserved core homologous proteins among the six genomes analyzed in the previous section.
Use BLAST to compare these sequences against the reference proteomes at /home/compgenomics/pangenomics/other_proteomes.faa.
Retrieve homologous sequences from reference proteomes (you can use the extract_seqs_from_blast_result.py script from previous exercises)
Build a multiple sequence alignment and construct a phylogenetic tree for each of the selected clusters.

Tasks:

Visualize the tree.
Interpret the results to determine:
- If all problem genomes belong to the same species.
- The species or genus these genomes belong to.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparative Genomics Exercise 5: Pangenomics

Environment Setup

Tools Required

Guided Exercise

1. Core-Pangenome Analysis

2. Calculate Average Nucleotide Identity (ANI)

3. Identify Open Reading Frames (ORFs) with Prokka

4. Cluster Proteins into Families

5. Phylogenetic Classification (optional)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally