Skip to content

Comparative Genomics Exercise 2: Orthology inference using Phylogeny

Jordi edited this page Jan 4, 2024 · 27 revisions

Environment

Login into the GDAV server

$ ssh youruser@IP

Create a directory in your home folder called compgenomics_ex2

$ mkdir compgenomics_ex2

and enter the directory

$ cd compgenomics_ex2

All the files needed for this exercise are copied in the GDAV server at /home/compgenomics/4proteomes/. Make sure you can see them and take a few seconds to understand what they contain:

$ ls /home/compgenomics/4proteomes/

They are:

4proteomes.faa    -> all protein sequences from 4 species: Human, Elefant, Zebrafish and Ciona intestinalis
G3T0S8_LOXAF.faa  -> protein sequence of the elefant gene called G3T0G8
TPH1A_rerio.faa   -> protein sequence of the Zebrafish gene called TPH1A
TPH2_human.faa    -> protein sequence of Human TPH2
scripts/          -> a directory with ad-hoc programs and scripts

Tools needed (already installed in the GDAV server)

  • BLAST+
  • IQ-Tree
  • MAFFT
  • BioPython
  • ete3
  • extract_seqs_from_blast_result.py
  • midpoint_rooting.py

Exercise

Goal 1

Reconstruct the phylogeny of all TPH2 proteins in the 4 target proteomes, and interpret the tree to identify orthologs.

1. Identify TPH homologs

As in exercise 1, use BLAST to identify all significant hits of the query protein TPH2_human.faa in the 4proteomes dataset.

Tip: You can reuse the BLAST database of exercise 1 and use a command similar to:

$ blastp \
      -task blastp \
      -query /home/compgenomics/4proteomes/TPH2_human.faa \
      -db ~/compgenomics_ex1/4proteomes.blastdb \
      -outfmt 6 \
      -evalue 0.001 > TPH2_homologs.blastout

2. Extract all homologs in FASTA format

Extract all homologs of the TPH2 sequences using the script provided in /home/compgenomics/4proteomes/scripts/extract_seqs_from_blast_result.py.

$ python /home/compgenomics/4proteomes/scripts/extract_seqs_from_blast_result.py \
      TPH2_homologs.blastout \
      /home/compgenomics/4proteomes/4proteomes.faa > TPH2_homologs.faa

3. Multiple Sequence Alignment (MSA)

Before inferring a phylogenetic tree, homologous sequences need to be aligned. There are multiple programs to do it: ClustalOmega, MAFFT, MUSLE, etc. Here, we will use MAFFT, which has a very simple command line.

$ mafft TPH2_homologs.faa > TPH2_homologs.alg

Check the content of the output (saved in TPH2_homologs.alg). What's the main difference compared to the input FASTA file?

4. Phylogenetic Reconstruction

Similarly to MSA programs, there are many software to build phylogenetic trees: RAXML, IQ-TREE, PhyML, FastTree, MrBayes, PhyloBayes, etc. Here we will use IQ-Tree, which uses a Maximum Likelihood approximation.

You only need to provide the MSA file as input, and some parameters defining how exhaustive should be the inference. To get a fast result, the following arguments are recommended (avoiding the step of model testing, which is very slow).

$ iqtree -s TPH2_homologs.alg -m LG

5. Visualize tree

Main IQ-Tree output is the file ending with the .treefile extension. The tree file is in Newick format.

You can use the command line tool ete3 to display the directly in the terminal:

$ ete3 view --text -t TPH2_homologs.alg.treefile

   /-A0A0R4ILE6_DANRE_tph2
  |
  |                  /-A0A2R8RPJ0_DANRE_th
  |               /-|
  |              |  |   /-TY3H_HUMAN_TH
  |            /-|   \-|
  |           |  |      \-G3U1E7_LOXAF_TH
  |         /-|  |
  |        |  |   \-Q1LWZ5_DANRE_th2
  ...

5. Root the tree

By default, phylogenetic trees returned by almost all programs are unrooted. There are many methods to root a tree, but a common one is the midpoint_rooting. For convenience, an ad hoc script to root Newick trees is provided in /home/compgenomics/4proteomes/scripts/midpoint_rooting.py

You can use it to root your tree before visualizing it:

$ python /home/compgenomics/4proteomes/scripts/midpoint_rooting.py \
      TPH2_homologs.alg.treefile | ete3 view --text

	    /-A0A2R8RPJ0_DANRE_th
	 /-|
	|  |   /-TY3H_HUMAN_TH
      /-|   \-|
     |  |      \-G3U1E7_LOXAF_TH
   /-|  |
  |  |   \-Q1LWZ5_DANRE_th2
  |  |
  |   \-F6Y7Q5_CIOIN_th
  |
  |         /-Q7SYH6_DANRE_pah
--|      /-|
  |     |  |   /-PH4H_HUMAN_PAH
  |   /-|   \-|
...

Alternatively, you can print the content of the file directly in the terminal and paste it into any of the online tree visualization servers (or transfer the file and upload it):

Questions

  • What's the evolutionary relationship between F1R1D3_DANRE_tph1a and Q6IWP4_DANRE_tph1b?
  • What's the evolutionary relationship between TPH2_HUMAN_TPH2 and TPH1_HUMAN_TPH1?
  • What's the Zebrafish (Danio Rerio) ortholog(s) of the human sequence TPH1_HUMAN_TPH1?
  • How many duplication events can you identify?
  • How many putative orthologous groups can you identify?

Goal 2

Repeat the same protocol to find all homologs of the P53 sequence in the 4 target proteomes, build a phylogeny, and identigy orthologs.

>P53_HUMAN_TP53
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAP
PQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRH
KKLMFKTEGPDSD

Questions

  1. Could you identify duplication and speciation events?
  2. How many putative orthologous groups can you identify?
  3. Is there anything unusual in the evolution of this gene family?
  4. Is the tree rooted? Where would you root it?
  5. Upload the tree to http://itol.embl.de and explore rooting options. Does it change the inference of duplication events?

Clone this wiki locally