Skip to content

Tutorial: Benchmarking Protein Subcellular Localization Prediction

Asem edited this page Mar 1, 2019 · 3 revisions

Tutorial on Benchmarking Performance of Protein Subcellular Localization Prediction by Different Methods

This tutorial assumes that the binary files are installed as instructed in the README.md file. We will use MC_pipeline tool to run a cross-validation on different configurations to be illustrated in this tutorial. The help dialog of MC_pipeline lists the different options that will be explained in next parts:

$ ./MC_pipeline -h
usage:
  MC_pipeline [<input>] options

where options are:
  -T, --test <test>                         test file
  -m, --model <bmc|gmc|mc|vmc|zymc>         Markov Chains model, default:bmc
  -G, --grouping <diamond11|nogrouping      grouping method, default:
  |ofer15|ofer8>                            diamond11
  -f, --fformat <deeploc_loc|deeploc_sol    input file processor, default:
  |files1|psort|targetp|uniref>             deeploc_loc
  -c, --criteria <bhat|chi|cos|dot|dpd1     Similarity Criteria, default:bhat
  |dpd2|dpd3|dwcos|euclidean|gaussian
  |hell|intersection|itakura-saitu|kl
  |mahalanobis|max_intersection>
  -s, --strategy <acc|discretized|kmers     Classification Strategy, default:
  |knn_mcp|knn_mcs|knn_stack|propensity     acc
  |rf_mcp|rf_mcs|rf_mcs_sp|segmentation
  |svm_mcp|svm_mcs|svm_stack|voting>
  -o, --order <MC order>                    Specify MC of higher order o,
                                            default:3
  -k, --k-fold <k-fold>                     cross validation k-fold, default:
                                            10
  -?, -h, --help                            display usage information

Downloading Datasets

The following repository contains important datasets that are used extensively in literature:

git clone https://github.com/A-Alaa/protein-localization-datasets.git

DeepLoc

In this tutorial, we will benchmark different methods using a dataset generated in this work. The data consists of 13858 proteins entries categorized into 10 subcellular locations, distributed as following:

Location #Sequences Location #Sequences
Nucleus 4043 Cytoplasm 2542
Extracellular 1973 Mitochondrion 1510
Cell membrane 1340 Endoplasmic reticulum 862
Plastid 757 Golgi apparatus 356
Lysosome/Vacuole 321 Peroxisome 154

In order to use this dataset in the pipeline, we need to specify a format processor to retrieve the sequences and their associated labels from the fasta file, so the command now should contain the following parameters:

./MC_pipeline (other options..) -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta 

Markov Model Selection

There are two possible Markov chains models can be used:

  1. zymc, which implements the Markov model with Zheng Yuan assumption when generalizing the model for higher orders, see: Prediction of protein subcellular locations using Markov chain models. For example, when considering second order Markov model, the Probability of P(s1s2s3) is decomposed to: .
  2. mc, without the above simplification.

For first order Markov model, zymc and mc become indifferent. For this tutorial, we will assume first-order, so choosing any model of the two won't make a difference. Our command line now becomes:

./MC_pipeline (other options..) -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta 

Similarity/Dissimilarity Metric Selection

Similarity/Dissimilarity metrics are used to produce the latent representation. The following table lists part of the supported metrics in the project:

Parameter name Interpretation
cos Cosine similarity
kl Kullback-Leibler Divergence
euclidean Euclidean Distance
chi Chi-squared Distance
gaussian Gaussian Radial Basis Function
mahalanobis Mahalanobis Distance
hell Hellinger Distance
intersection Intersection Similarity

For implementation details and other metrics that are not listed in the table, see: src/include/SimilarityMetrics.hpp.

In this tutorial, we will use cos similarity function, which is very stable in performance.

Our command line now becomes:

./MC_pipeline (other options..) -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta 

Classification Method

Parameter name Interpretation
propensity Traditional inference of Markov chains using maximum propensity.
svm_mcs Classification using Ensemble of binary Support Vector Machines. Each sequence is represented by its latent vector as a feature vector.
rf_mcs Classification using Ensemble of binary Random Forests. Each sequence is represented by its latent vector as a feature vector.
knn_mcs Classification using K-nearest neighbors. Each sequence is represented by its latent vector as a feature vector.

For this tutorial, we can specify a set of classifiers in a single run to benchmark their performance.

For example, we can use [propensity] alone, if we are interested in maximum propensity based inference, or [propensity,rf_mcs] if we are interested in both the propensity and the classification using the random forest method described in the table.

So, now our command line should look like this:

./MC_pipeline (other options..) -s [propensity,rf_mcs,knn_mcs] -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta 

Amino Acids Grouping

This option allows us to reduce the space of the amino acids. For example, we can reduce each instance of either K or R into single character, so now our peptide is sampled from 19 amino acids instead of 20. In this tutorial, we will assume no grouping are used, we will use amino acids as they are.

The command line should become:

./MC_pipeline (other options..) --grouping nogrouping -s [propensity,rf_mcs,knn_mcs] -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta 

Cross-Validation Settings

Finally, we specify the k parameter as the number of folds of our stratified k-fold cross-validation. So the final command line becomes:

./MC_pipeline -k 10 --grouping nogrouping -s [propensity,rf_mcs,knn_mcs] -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta