-
Notifications
You must be signed in to change notification settings - Fork 19
Comparative Genomics Exercise 2: Orthology inference using Phylogeny
Login into the GDAV server
ssh youruser@IP
Create a directory in your home folder called compgenomics_ex1
mkdir compgenomics_ex2
and Enter the directory
cd compgenomics_ex2
All the files needed for this exercise are copied in the GDAV server at /home/compgenomics/4proteomes/. Make sure you can see them and take a few seconds to understand what they contain:
ls /home/compgenomics/4proteomes/
$ ls /home/compgenomics/4proteomes/
4proteomes.faa -> all protein sequences from 4 species: human, elefant, Zebrafish and Ciona intestinalis.
G3T0S8_LOXAF.faa -> protein sequence of the elefant gene called G3T0G8
TPH1A_rerio.faa -> protein sequence of the Zebrafish gene called TPH1A
TPH2_human.faa -> protein sequence of Human TPH2
scripts/ -> a directory with ad hoc programs and scripts`
(already installed in the GDAV server)
- BLAST+
- IQ-Tree
- MAFFT
- BioPython
- scripts/extract_seqs_from_blast_result.py
Reconstruct the phylogeny of all TPH2 proteins in the 4 target proteomes, and interpret the tree to identify orthologs.
As in exercise 1, use BLAST to identify all significant hits of the query protein TPH2_human.faa in the 4proteomes dataset.
Tip: You can reuse the BLAST database of exercise 1 and use a command similar to:
$ blastp -task blastp -query /home/compgenomics/ex1/TPH2_human.faa -db ~/compgenomics_ex1/4proteomes.blastdb -outfmt 6 -evalue 0.001 > TPH2_homologs.blastout
Extract all homologs of the TPH2 sequences using the script provided in /home/compgenomics/ex1/scripts/extract_seqs_from_blast_result.py.
$ python /home/compgenomics/ex1/scripts/extract_seqs_from_blast_result.py TPH2_homologs.blastout /home/compgenomics/ex1/4proteomes.faa > TPH2_homologs.faa
Align sequences in a multiple sequence alignment
$ mafft all_homologs.faa > all_homologs.alg
Reconstruct a phylogenetic tree
$ ./iqtree -s all_homologs.alg
Visualize tree (*.treefile) in http://etetoolkit.org/treeview/ Could you explain the co-orthology scenario of Danio rerio and Loxodonta sequences?
Visualize tree and alignment. Could you guess what’s happening with Loxodonta homologs? Comparative Genomics Exercise 2 (Fine-grained phylogenetic detection of orthologs and paralogs)
Find homologs of human protein TP53 in the same 4 species from exercise 1 (Human, D.rerio, C.intestinalis, L.affricana). Tip: blast the sequence and filter by evalue.
P53_HUMAN_TP53 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAP PQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRH KKLMFKTEGPDSD
Extract the sequences of all homologs and build a phylogenetic tree with them. Visualize the tree using http://etetoolkit.org/treeview or http://itol.embl.de
Could you identify duplication and speciation events?
How many putative orthologous groups can you identify?
Is there anything unusual in the evolution of this gene family?
Is the tree rooted? Where would you root it?
Upload the tree into http://itol.embl.de and explore rooting options. Does it change the inference of duplication events?
Advanced tip: Handling phylogenetic trees programmatically using ete3