Skip to content

Comparative Genomics Exercise 5: Pangenomics

Jaime Huerta-Cepas edited this page Dec 9, 2025 · 4 revisions

#Pangenome Analysis and Phylogenetic Classification

This exercise guides you through analyzing the pangenome of six bacterial genomes and determining their phylogenetic relationships.


Environment Setup

  1. Login to the GDAV server:

    $ ssh youruser@IP
  2. Activate the conda environment:

    $ conda activate gdav23
  3. Create and navigate to a working directory:

    $ mkdir ~/pangenomics
    $ cd ~/pangenomics
  4. Verify input files:

    All required files are available in /home/compgenomics/pangenomics/. List the contents of this directory to familiarize yourself with the files:

    $ ls /home/compgenomics/pangenomics/

Tools Required

The following tools are pre-installed on the GDAV server:

  • MMSeqs2. eval "$(/home/miniforge3/bin/conda shell.bash hook)" && conda activate gdav23
  • Prokka eval "$(/home/root/miniforge3/bin/conda shell.bash hook)" && conda activate prokka

Guided Exercise

1. Core-Pangenome Analysis

Task: Identify gene families that are part of the core-pangenome of six provided genomes.


2. Calculate Average Nucleotide Identity (ANI)

Steps:

  • Use the ANI Calculator tool available at https://www.ezbiocloud.net/tools/ani.

  • Compare two genomes from /home/compgenomics/pangenomics/genome_assemblies/.

    Question: Are they the same species?

  • Compare any genome in genome_assemblies with the genome at /home/compgenomics/pangenomics/other_genomes/another_genome.fna.

    Question: Are these genomes from microorganisms in the same species, genus, or family rank?


3. Identify Open Reading Frames (ORFs) with Prokka

Command:

$ for x in /home/compgenomics/pangenomics/genome_assemblies/*.fna; do \
      prokka --kingdom Bacteria --prefix `basename $x` \
             --outdir prokka/`basename $x`.prokka $x; \
  done
  • Prokka may take several minutes to complete.

  • Precomputed results are available at /home/compgenomics/pangenomics/prokka.

    Question: How many proteins (coding sequences) are detected in each genome?


4. Cluster Proteins into Families

  1. Concatenate protein sequences into a single file:

    $ cat /home/compgenomics/pangenomics/prokka/GCA*/*.faa > all_proteins.faa
  2. Create an MMSeqs2 database:

    $ mmseqs createdb all_proteins.faa all_proteins.db
  3. Cluster proteins with minimum coverage of 30% and identity of 20%:

    $ mmseqs cluster -c 0.3 --min-seq-id 0.2 all_proteins.db clustering.db tmp
  4. Convert clustering results into a TSV format:

    $ mmseqs createtsv all_proteins.db all_proteins.db clustering.db clustering.tsv

    Questions:

    • How many clusters did you find?
    • How many singletons are there?
    • How many families could be considered part of the core pangenome?
    • What is the effect of varying identity and coverage thresholds? Try a stricter parameter set (e.g., 75% identity).

    Discussion: Refer to the MMSeqs2 documentation. How would you modify the approach to cluster metagenomic sequences?


5. Phylogenetic Classification (optional)

Objective: Build a phylogenetic tree to determine if the problem genomes belong to the same species and identify the species.

Steps:

  1. Identify (at your choice) one or various clusters of conserved core homologous proteins among the six genomes analyzed in the previous section.
  2. Use BLAST to compare these sequences against the reference proteomes at /home/compgenomics/pangenomics/other_proteomes.faa.
  3. Retrieve homologous sequences from reference proteomes (you can use the extract_seqs_from_blast_result.py script from previous exercises)
  4. Build a multiple sequence alignment and construct a phylogenetic tree for each of the selected clusters.

Tasks:

  • Visualize the tree.
  • Interpret the results to determine:
    • If all problem genomes belong to the same species.
    • The species or genus these genomes belong to.

Clone this wiki locally