Skip to content

Comparative Genomics Exercise 3.5: Species tree reconstruction

Jaime Huerta-Cepas edited this page Nov 26, 2024 · 1 revision

Background

A tertiary care hospital in Madrid has reported a sudden spike in severe lung infections among immunocompromised patients. Despite standard antibiotics and antifungal therapies, the infections persist, suggesting an unusual pathogen is at play. A mycology research unit within the hospital’s pathology department has isolated a fungal specimen from several patients that, while resembling Penicillium species in morphology, exhibits atypical pathogenic behaviors.

Traditional diagnostic methods, including microscopic examination, culture characteristics, and basic serological tests, have proved inadequate in classifying this isolate. Due to the urgency of the situation, the hospital's genetics lab uses rapid nanopore sequencing to analyze the genome of the fungus in hopes of identifying its origins and potential virulence factors, but they lack the expertise to perform a thorough phylogenomic analysis.

The Hospital just contacted you as leading expert in Phylogenomics. You are in the middle of a course, but who cares!, this looks important. You open your laptop, get a cup of coffee, and start looking at the data that the Hospital has sent you.

The inferred proteome of the fungal isolate is at isolate/isolate.pep.fa. You pick some random entries and perform some online BLAST searches, confirming that the specimen has close relatives in the genus Talaromyces. But sequences appear to be quite divergent from all known species, with about 85-90% protein identity. Based on these clues, your team has prepared the ground for a phylogenomic survey of this new fungal species by performing the following two preliminary steps for you:

Full source: https://sugary-stem-206.notion.site/Main-exercise-b1c68c4133f6402abd1e1c7abe218c53

Building a Species Tree Pipeline

Introduction

This exercise guides you through constructing a species tree from genomic data. The pipeline involves clustering sequences, identifying single-copy universal genes, building gene trees, and combining them into a supertree. The steps use tools such as MMseqs2, MAFFT, FastTree, ASTRAL, and ete3.

Steps

Step 1: Cluster Sequences with MMseqs2

Run the following command to cluster sequences into protein families:

mmseqs easy-cluster all_proteomes.fa all_proteomes tmp --min-seq-id 0.3 -c 0.5 --cov-mode 1
  • Input: A FASTA file (all_proteomes.fa) containing all species' sequences.
  • Output: Cluster assignments (all_proteomes_cluster.tsv) mapping each sequence to a cluster.
  • Key Parameters: - --min-seq-id 0.3: Clusters sequences with ≥30% identity. - -c 0.5: Requires ≥50% alignment coverage. - --cov-mode 1: Shorter sequence used as the reference.

Step 2: Aggregate Clusters

Aggregate the clusters into a readable TSV format where each line lists all members of a cluster.

cat all_proteomes_cluster.tsv | datamash -g1 collapse 2 > all_proteomes_cluster.folded.tsv
  • Tool: datamash groups rows by cluster and concatenates members into a single line.
  • Output: A consolidated TSV file (all_proteomes_cluster.folded.tsv).

Step 3: Identify Single-Copy Universal Clusters

Use the following commands to identify clusters with exactly one sequence per species across all species:

python calculate_sp_and_seqs_in_cluster.py all_proteomes_cluster.folded.tsv > all_proteomes_cluster.folded.nspecies.tsv
awk '$3==104 && $3==$4' all_proteomes_cluster.folded.nspecies.tsv > single_copy_universal_families.tsv
  • Script: calculate_sp_and_seqs_in_cluster.py - Input: Aggregated cluster file (all_proteomes_cluster.folded.tsv). - Output: Cluster table with species and sequence counts.
  • Filter: Use awk to extract clusters containing exactly 104 species, each with a single sequence.

Step 4: Extract FASTA Sequences

Extract FASTA sequences for each universal cluster using the following command:

python extract_seqs.py -s ../all_proteomes.fa -c single_copy_universal_families.tsv
  • Script: extract_seqs.py - Input: Universal cluster list and the master FASTA file. - Output: Individual FASTA files for each cluster.

Step 5: Build Gene Trees

Generate and execute commands to align sequences and build gene trees for each cluster:

python build_gene_tree.py -c single_copy_universal_families.tsv > cmds
grep mafft cmds | parallel -j40
grep fasttree cmds | parallel -j40
  • Script: build_gene_tree.py generates commands for: - MAFFT: Align sequences within each cluster. - FastTree: Build phylogenetic trees from aligned sequences.
  • Parallelization: Run commands concurrently using parallel for faster processing.

Step 6: Convert Gene Trees to Species Trees

Convert gene trees into species trees by replacing sequence IDs with species codes:

python convert_gene_tree_to_species_tree.py trees/*.nw > all_sp_trees.txt
  • Script: convert_gene_tree_to_species_tree.py - Input: Gene trees (Newick format). - Output: Species trees with only species codes.

Step 7: Build a Species Super Tree

Combine all species trees into a single consensus tree using ASTRAL:

astral --input all_sp_trees.txt --output astral_tree.nw
  • Tool: ASTRAL constructs a consensus tree by integrating multiple species trees.

Step 8: Compare Gene Trees to the Final Species Tree

Compare individual gene trees to the final species tree to identify discrepancies:

ete3 compare -r astral_tree.nw --src_tree_list all_sp_trees.txt --unrooted
  • Tool: ete3 computes tree distances and reports differences between individual gene trees and the final supertree.

Conclusion

By completing this pipeline, you have generated a species tree from genomic data. This process integrates clustering, alignment, phylogenetics, and supertree construction to derive insights about evolutionary relationships across species.

Clone this wiki locally