Tutorial: Benchmarking Protein Subcellular Localization Prediction

Tutorial on Benchmarking Performance of Protein Subcellular Localization Prediction by Different Methods

This tutorial assumes that the binary files are installed as instructed in the README.md file. We will use MC_pipeline tool to run a cross-validation on different configurations to be illustrated in this tutorial. The help dialog of MC_pipeline lists the different options that will be explained in next parts:

$ ./MC_pipeline -h
usage:
  MC_pipeline [<input>] options

where options are:
  -T, --test <test>                         test file
  -m, --model <bmc|gmc|mc|vmc|zymc>         Markov Chains model, default:bmc
  -G, --grouping <diamond11|nogrouping      grouping method, default:
  |ofer15|ofer8>                            diamond11
  -f, --fformat <deeploc_loc|deeploc_sol    input file processor, default:
  |files1|psort|targetp|uniref>             deeploc_loc
  -c, --criteria <bhat|chi|cos|dot|dpd1     Similarity Criteria, default:bhat
  |dpd2|dpd3|dwcos|euclidean|gaussian
  |hell|intersection|itakura-saitu|kl
  |mahalanobis|max_intersection>
  -s, --strategy <acc|discretized|kmers     Classification Strategy, default:
  |knn_mcp|knn_mcs|knn_stack|propensity     acc
  |rf_mcp|rf_mcs|rf_mcs_sp|segmentation
  |svm_mcp|svm_mcs|svm_stack|voting>
  -o, --order <MC order>                    Specify MC of higher order o,
                                            default:3
  -k, --k-fold <k-fold>                     cross validation k-fold, default:
                                            10
  -?, -h, --help                            display usage information

Downloading Datasets

The following repository contains important datasets that are used extensively in literature:

git clone https://github.com/A-Alaa/protein-localization-datasets.git

DeepLoc

In this tutorial, we will benchmark different methods using a dataset generated in this work. The data consists of 13858 proteins entries categorized into 10 subcellular locations, distributed as following:

Location	#Sequences	Location	#Sequences
Nucleus	4043	Cytoplasm	2542
Extracellular	1973	Mitochondrion	1510
Cell membrane	1340	Endoplasmic reticulum	862
Plastid	757	Golgi apparatus	356
Lysosome/Vacuole	321	Peroxisome	154

In order to use this dataset in the pipeline, we need to specify a format processor to retrieve the sequences and their associated labels from the fasta file, so the command now should contain the following parameters:

./MC_pipeline (other options..) -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta

Markov Model Selection

There are two possible Markov chains models can be used:

zymc, which implements the Markov model with Zheng Yuan assumption when generalizing the model for higher orders, see: Prediction of protein subcellular locations using Markov chain models. For example, when considering second order Markov model, the Probability of P(s1s2s3) is decomposed to: .
mc, without the above simplification.

For first order Markov model, zymc and mc become indifferent. For this tutorial, we will assume first-order, so choosing any model of the two won't make a difference. Our command line now becomes:

./MC_pipeline (other options..) -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta

Similarity/Dissimilarity Metric Selection

Similarity/Dissimilarity metrics are used to produce the latent representation. The following table lists part of the supported metrics in the project:

Parameter name	Interpretation
`cos`	Cosine similarity
`kl`	Kullback-Leibler Divergence
`euclidean`	Euclidean Distance
`chi`	Chi-squared Distance
`gaussian`	Gaussian Radial Basis Function
`mahalanobis`	Mahalanobis Distance
`hell`	Hellinger Distance
`intersection`	Intersection Similarity

For implementation details and other metrics that are not listed in the table, see: src/include/SimilarityMetrics.hpp.

In this tutorial, we will use cos similarity function, which is very stable in performance.

Our command line now becomes:

./MC_pipeline (other options..) -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta

Classification Method

Parameter name	Interpretation
`propensity`	Traditional inference of Markov chains using maximum propensity.
`svm_mcs`	Classification using Ensemble of binary Support Vector Machines. Each sequence is represented by its latent vector as a feature vector.
`rf_mcs`	Classification using Ensemble of binary Random Forests. Each sequence is represented by its latent vector as a feature vector.
`knn_mcs`	Classification using K-nearest neighbors. Each sequence is represented by its latent vector as a feature vector.

For this tutorial, we can specify a set of classifiers in a single run to benchmark their performance.

For example, we can use [propensity] alone, if we are interested in maximum propensity based inference, or [propensity,rf_mcs] if we are interested in both the propensity and the classification using the random forest method described in the table.

So, now our command line should look like this:

./MC_pipeline (other options..) -s [propensity,rf_mcs,knn_mcs] -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta

Amino Acids Grouping

This option allows us to reduce the space of the amino acids. For example, we can reduce each instance of either K or R into single character, so now our peptide is sampled from 19 amino acids instead of 20. In this tutorial, we will assume no grouping are used, we will use amino acids as they are.

The command line should become:

./MC_pipeline (other options..) --grouping nogrouping -s [propensity,rf_mcs,knn_mcs] -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta

Cross-Validation Settings

Finally, we specify the k parameter as the number of folds of our stratified k-fold cross-validation. So the final command line becomes:

./MC_pipeline -k 10 --grouping nogrouping -s [propensity,rf_mcs,knn_mcs] -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!