Skip to content

Comparative Genomics Exercise 2: Orthology inference using Phylogeny

Jaime Huerta-Cepas edited this page Oct 15, 2020 · 27 revisions

Environment:

Login into the GDAV server

ssh youruser@IP

Create a directory in your home folder called compgenomics_ex1

mkdir compgenomics_ex2

and Enter the directory

cd compgenomics_ex2

All the files needed for this exercise are copied in the GDAV server at /home/compgenomics/4proteomes/. Make sure you can see them and take a few seconds to understand what they contain:

ls  /home/compgenomics/4proteomes/
$ ls /home/compgenomics/4proteomes/
4proteomes.faa    -> all protein sequences from 4 species: human, elefant, Zebrafish and Ciona intestinalis. 
G3T0S8_LOXAF.faa  -> protein sequence of the elefant gene called G3T0G8
TPH1A_rerio.faa   ->  protein sequence of the Zebrafish gene called TPH1A
TPH2_human.faa    -> protein sequence of Human TPH2 
scripts/          ->  a directory with ad hoc programs and scripts`

Tools needed

(already installed in the GDAV server)

  • BLAST+
  • IQ-Tree
  • MAFFT
  • BioPython
  • scripts/extract_seqs_from_blast_result.py

Exercise

Goal 1

$ blastp -task blastp -query TPH1_HUMAN.faa -db all_species.parsed.fasta -outfmt 6 -evalue 0.001 > blast_result

Extract all homologs of the TPH1 sequences using the provided script extract_seqs_from_blast_result.py

$ python extract_seqs_from_blast_result.py blast_result all_species.parsed.fasta > all_homologs.faa

Align sequences in a multiple sequence alignment

$ mafft all_homologs.faa > all_homologs.alg

Reconstruct a phylogenetic tree

$ ./iqtree -s all_homologs.alg

Visualize tree (*.treefile) in http://etetoolkit.org/treeview/ Could you explain the co-orthology scenario of Danio rerio and Loxodonta sequences?

Visualize tree and alignment. Could you guess what’s happening with Loxodonta homologs? Comparative Genomics Exercise 2 (Fine-grained phylogenetic detection of orthologs and paralogs)

Goal 2

Find homologs of human protein TP53 in the same 4 species from exercise 1 (Human, D.rerio, C.intestinalis, L.affricana). Tip: blast the sequence and filter by evalue.

P53_HUMAN_TP53 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAP PQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRH KKLMFKTEGPDSD

Extract the sequences of all homologs and build a phylogenetic tree with them. Visualize the tree using http://etetoolkit.org/treeview or http://itol.embl.de Could you identify duplication and speciation events? How many putative orthologous groups can you identify?
Is there anything unusual in the evolution of this gene family?

Is the tree rooted? Where would you root it?

Upload the tree into http://itol.embl.de and explore rooting options. Does it change the inference of duplication events?

Advanced tip: Handling phylogenetic trees programmatically using ete3

Clone this wiki locally