-
Notifications
You must be signed in to change notification settings - Fork 19
Comparative Genomics Exercise 2: Orthology inference using Phylogeny
Login into the GDAV server
$ ssh youruser@IPCreate a directory in your home folder called compgenomics_ex2
$ mkdir compgenomics_ex2and enter the directory
$ cd compgenomics_ex2All the files needed for this exercise are copied in the GDAV server
at /home/compgenomics/4proteomes/. Make sure you can see them and
take a few seconds to understand what they contain:
$ ls /home/compgenomics/4proteomes/They are:
4proteomes.faa -> all protein sequences from 4 species: Human, Elefant, Zebrafish and Ciona intestinalis
G3T0S8_LOXAF.faa -> protein sequence of the elefant gene called G3T0G8
TPH1A_rerio.faa -> protein sequence of the Zebrafish gene called TPH1A
TPH2_human.faa -> protein sequence of Human TPH2
scripts/ -> a directory with ad-hoc programs and scripts
- BLAST+
- IQ-Tree
- MAFFT
- BioPython
- ete3
- extract_seqs_from_blast_result.py
- midpoint_rooting.py
Reconstruct the phylogeny of all TPH2 proteins in the 4 target proteomes, and interpret the tree to identify orthologs.
As in exercise 1, use BLAST to identify all significant hits of the
query protein TPH2_human.faa in the 4proteomes dataset.
Tip: You can reuse the BLAST database of exercise 1 and use a command similar to:
$ blastp \
-task blastp \
-query /home/compgenomics/4proteomes/TPH2_human.faa \
-db ~/compgenomics_ex1/4proteomes.blastdb \
-outfmt 6 \
-evalue 0.001 > TPH2_homologs.blastoutExtract all homologs of the TPH2 sequences using the script provided
in
/home/compgenomics/4proteomes/scripts/extract_seqs_from_blast_result.py.
$ python /home/compgenomics/4proteomes/scripts/extract_seqs_from_blast_result.py \
TPH2_homologs.blastout \
/home/compgenomics/4proteomes/4proteomes.faa > TPH2_homologs.faaBefore inferring a phylogenetic tree, homologous sequences need to be aligned. There are multiple programs to do it: ClustalOmega, MAFFT, MUSLE, etc. Here, we will use MAFFT, which has a very simple command line.
$ mafft TPH2_homologs.faa > TPH2_homologs.algCheck the content of the output (saved in TPH2_homologs.alg). What's the main difference compared to the input FASTA file?
Similarly to MSA programs, there are many software to build phylogenetic trees: RAXML, IQ-TREE, PhyML, FastTree, MrBayes, PhyloBayes, etc. Here we will use IQ-Tree, which uses a Maximum Likelihood approximation.
You only need to provide the MSA file as input, and some parameters defining how exhaustive should be the inference. To get a fast result, the following arguments are recommended (avoiding the step of model testing, which is very slow).
$ iqtree -s TPH2_homologs.alg -m LGMain IQ-Tree output is the file ending with the .treefile
extension. The tree file is in Newick
format.
You can use the command line tool ete3 to display the directly in
the terminal:
$ ete3 view --text -t TPH2_homologs.alg.treefile
/-A0A0R4ILE6_DANRE_tph2
|
| /-A0A2R8RPJ0_DANRE_th
| /-|
| | | /-TY3H_HUMAN_TH
| /-| \-|
| | | \-G3U1E7_LOXAF_TH
| /-| |
| | | \-Q1LWZ5_DANRE_th2
...By default, phylogenetic trees returned by almost all programs are
unrooted. There are many methods to root a tree, but a common one is
the midpoint_rooting. For convenience, an ad hoc script to root
Newick trees is provided in
/home/compgenomics/4proteomes/scripts/midpoint_rooting.py
You can use it to root your tree before visualizing it:
$ python /home/compgenomics/4proteomes/scripts/midpoint_rooting.py \
TPH2_homologs.alg.treefile | ete3 view --text
/-A0A2R8RPJ0_DANRE_th
/-|
| | /-TY3H_HUMAN_TH
/-| \-|
| | \-G3U1E7_LOXAF_TH
/-| |
| | \-Q1LWZ5_DANRE_th2
| |
| \-F6Y7Q5_CIOIN_th
|
| /-Q7SYH6_DANRE_pah
--| /-|
| | | /-PH4H_HUMAN_PAH
| /-| \-|
...Alternatively, you can print the content of the file directly in the terminal and paste it into any of the online tree visualization servers (or transfer the file and upload it):
- What's the evolutionary relationship between
F1R1D3_DANRE_tph1aandQ6IWP4_DANRE_tph1b? - What's the evolutionary relationship between
TPH2_HUMAN_TPH2andTPH1_HUMAN_TPH1? - What's the Zebrafish (Danio Rerio) ortholog(s) of the human sequence
TPH1_HUMAN_TPH1? - How many duplication events can you identify?
- How many putative orthologous groups can you identify?
Repeat the same protocol to find all homologs of the P53 sequence in the 4 target proteomes, build a phylogeny, and identigy orthologs.
>P53_HUMAN_TP53
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAP
PQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRH
KKLMFKTEGPDSD
- Could you identify duplication and speciation events?
- How many putative orthologous groups can you identify?
- Is there anything unusual in the evolution of this gene family?
- Is the tree rooted? Where would you root it?
- Upload the tree to http://itol.embl.de and explore rooting options. Does it change the inference of duplication events?