Skip to content

Comparative Genomics Exercise 1: Orthology inference using BLAST

Jordi edited this page Jan 29, 2024 · 10 revisions

Environment

Login into the GDAV server

$ ssh youruser@IP

Create a directory in your home folder called compgenomics_ex1

$ mkdir compgenomics_ex1

and enter the directory

$ cd compgenomics_ex1

All the files needed for this exercise are copied in the GDAV server at /home/compgenomics/4proteomes/. Make sure you can see them and take a few seconds to understand what they contain:

$ ls /home/compgenomics/4proteomes/

They are:

4proteomes.faa    -> all protein sequences from 4 species: Human, Elefant, Zebrafish and Ciona intestinalis
G3T0S8_LOXAF.faa  -> protein sequence of the elefant gene called G3T0G8
TPH1A_rerio.faa   -> protein sequence of the Zebrafish gene called TPH1A
TPH2_human.faa    -> protein sequence of Human TPH2
scripts/          -> a directory with ad-hoc programs and scripts

Tools needed (already installed in the GDAV server)

  • BLAST+

Exercise

Goal 1 (Loxodonta orthologs)

Using just BLAST reciprocal searches, could you identify which is the human ortholog of the Loxodonta protein G3T0S8? (Remember that the protein sequence of G3T0S8 is available in /home/compgenomics/4proteomes/G3T0S8_LOXAF.faa.)

Protocol

1. Create a BLAST database

Make a blast database containing the 4 input proteomes (Human, Danio rerio, Ciona intestinalis, and Loxodonta africana) and name the database as 4proteomes.blastdb. All proteomes are already merged into a single FASTA file /home/compgenomics/4proteomes/4proteomes.faa

$ makeblastdb \
      -dbtype prot \
      -in /home/compgenomics/4proteomes/4proteomes.faa \
      -out 4proteomes.blastdb

You should now see something like this in your exercise home folder:

$ ls -l
total 48604
-rw-rw-r--. 1 test test  7434160 oct 15 15:58 4proteomes.blastdb.phr
-rw-rw-r--. 1 test test   683920 oct 15 15:58 4proteomes.blastdb.pin
-rw-rw-r--. 1 test test 41649539 oct 15 15:58 4proteomes.blastdb.psq

2. Find G3T0S8_LOXAF.faa homologs.

Use the blastp command to search for all homologs of the G3T0S8 sequence. Use an evalue threshold of 0.001.

$ blastp \
      -task blastp \
      -query /home/compgenomics/4proteomes/G3T0S8_LOXAF.faa \
      -db 4proteomes.blastdb \
      -outfmt 6 \
      -evalue 0.001

3. Answer these questions

  1. How many homologs of G3T0S8 are in human?
  2. Which is the closest one?
  3. Are they orthologs?

4. Save homologous sequences

Extract the sequence of the closest homolog of G3T0S8_LOXAF in human and save it a new file called G3T0S8_best_human_hit.faa.

There are many different ways to do this. You can open the FASTA file containing all proteomes, search by the sequence name and extract it manually, or you can (as you should be able to) do it from the command line.

$ grep -A 1 \
      [HUMAN_homolog_in_blast_result] \
      /home/compgenomics/4proteomes/4proteomes.faa > G3T0S8_best_human_hit.faa

5. BLAST G3T0S8_best_human_hit.faa against the same databases

$ blastp \
      -task blastp \
      -query [HUMAN_seq_file] \
      -db 4proteomes.blastdb \
      -outfmt 6 \
      -evalue 0.001

6. Answer these questions

  1. Are they reciprocal hits?
  2. Are they orthologous with each other?

Goal 2 (Zebrafish orthologs)

Repeat the previous protocol with the Zebrafish sequence found in the file TPH1A_rerio.faa (Danio rerio homolog).

Answer these questions

  1. What are the Zebrafish homologs in human?
  2. Are they the same as in the Loxodonta example?
  3. What is the ortholog in human (based on reciprocal blast)?
  4. What could you tell about the gene TPH1B in Danio rerio?

Clone this wiki locally