-
Notifications
You must be signed in to change notification settings - Fork 19
Comparative Genomics Exercise 2: Orthology inference using Phylogeny
Login into the GDAV server
ssh youruser@IP
Create a directory in your home folder called compgenomics_ex2
mkdir compgenomics_ex2
and Enter the directory
cd compgenomics_ex2
All the files needed for this exercise are copied in the GDAV server at /home/compgenomics/4proteomes/. Make sure you can see them and take a few seconds to understand what they contain:
ls /home/compgenomics/4proteomes/
$ ls /home/compgenomics/4proteomes/
4proteomes.faa -> all protein sequences from 4 species: human, elefant, Zebrafish and Ciona intestinalis.
G3T0S8_LOXAF.faa -> protein sequence of the elefant gene called G3T0G8
TPH1A_rerio.faa -> protein sequence of the Zebrafish gene called TPH1A
TPH2_human.faa -> protein sequence of Human TPH2
scripts/ -> a directory with ad hoc programs and scripts`
(already installed in the GDAV server)
- BLAST+
- IQ-Tree
- MAFFT
- BioPython
- ete3
- scripts/extract_seqs_from_blast_result.py
- scripts/midpoint_rooting.py
Reconstruct the phylogeny of all TPH2 proteins in the 4 target proteomes, and interpret the tree to identify orthologs.
As in exercise 1, use BLAST to identify all significant hits of the query protein TPH2_human.faa in the 4proteomes dataset.
Tip: You can reuse the BLAST database of exercise 1 and use a command similar to:
$ blastp -task blastp -query /home/compgenomics/4proteomes/TPH2_human.faa -db ~/compgenomics_ex1/4proteomes.blastdb -outfmt 6 -evalue 0.001 > TPH2_homologs.blastout
Extract all homologs of the TPH2 sequences using the script provided in /home/compgenomics/ex1/scripts/extract_seqs_from_blast_result.py.
$ python /home/compgenomics/ex1/scripts/extract_seqs_from_blast_result.py TPH2_homologs.blastout /home/compgenomics/ex1/4proteomes.faa > TPH2_homologs.faa
Before inferring a phylogenetic tree, homologous sequences need to be aligned. There are multiple programs to do it: ClustalOmega, MAFFT, MUSLE, etc. Here, we will use MAFFT, which has a very simple command line.
$ mafft TPH2_homologs.faa > TPH2_homologs.alg
Check the content of the output (saved in TPH2_homologs.alg). What's the main difference compared to the input FASTA file?
Similarly to MSA programs, there are many software to build phylogenetic trees: RAXML, IQ-TREE, PhyML, FastTree, MrBayes, PhyloBayes, etc. Here we will use IQ-Tree, which uses a Maximum Likelihood approximation.
You only need to provide the MSA file as input, and some parameters defining how exhaustive should be the inference. To get a fast result, the following arguments are recommended (avoiding the step of model testing, which is very slow).
$ iqtree -s TPH2_homologs.alg -m LG
Main IQ-Tree output is the file ending with the .treefile extension. The tree file is in [https://en.wikipedia.org/wiki/Newick_format](Newick Format).
You can use the command line tool ete3 to display the directly in the terminal:
$ ete3 view --text -t TPH2_homologs.alg.treefile
/-A0A0R4ILE6_DANRE_tph2
|
| /-A0A2R8RPJ0_DANRE_th
| /-|
| | | /-TY3H_HUMAN_TH
| /-| \-|
| | | \-G3U1E7_LOXAF_TH
| /-| |
| | | \-Q1LWZ5_DANRE_th2
...
By default, phylogenetic trees returned by almost all programs are UNROOTED. There are many methods to root a tree, but a common one is the midpoint_rooting. For convenience, an ad hoc script to root Newick trees is provided in /home/compgenomics/ex1/scripts/midpoint_rooting.py
You can use to root your tree before visualizing it:
$ python /home/compgenomics/ex1/scripts/midpoint_rooting.py TPH2_homologs.alg.treefile | ete3 view --text
/-A0A2R8RPJ0_DANRE_th
/-|
| | /-TY3H_HUMAN_TH
/-| \-|
| | \-G3U1E7_LOXAF_TH
/-| |
| | \-Q1LWZ5_DANRE_th2
| |
| \-F6Y7Q5_CIOIN_th
|
| /-Q7SYH6_DANRE_pah
--| /-|
| | | /-PH4H_HUMAN_PAH
| /-| \-|
...
Alternatively, you can print the content of the file directly in the terminal and pasted it into any of the online tree visualization server (or transfer the file and upload it):
- What's the evolutionary relationship between F1R1D3_DANRE_tph1a and Q6IWP4_DANRE_tph1b ?
- What's the evolutionary relationship between TPH2_HUMAN_TPH2 and TPH1_HUMAN_TPH1 ?
- What's the Zebrafish (Danio Rerio) ortholog(s) of the human sequence
TPH1_HUMAN_TPH1? - How many duplication events can you identify?
- How many putative orthologous groups can you identify?
Repeat the same protocol to find all homologs of the P53 sequence in the 4 target proteomes, build a phylogeny, and identigy orthologs.
>P53_HUMAN_TP53
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAP
PQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRH
KKLMFKTEGPDSD
- Could you identify duplication and speciation events?
- How many putative orthologous groups can you identify?
- Is there anything unusual in the evolution of this gene family?
- Is the tree rooted? Where would you root it?
- Upload the tree into http://itol.embl.de and explore rooting options. Does it change the inference of duplication events?