-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorial: Benchmarking Protein Subcellular Localization Prediction
Tutorial on Benchmarking Performance of Protein Subcellular Localization Prediction by Different Methods
This tutorial assumes that the binary files are installed as instructed in the README.md file. We will use MC_pipeline tool to run a cross-validation on different configurations to be illustrated in this tutorial.
The help dialog of MC_pipeline lists the different options that will be explained in next parts:
$ ./MC_pipeline -h
usage:
MC_pipeline [<input>] options
where options are:
-T, --test <test> test file
-m, --model <bmc|gmc|mc|vmc|zymc> Markov Chains model, default:bmc
-G, --grouping <diamond11|nogrouping grouping method, default:
|ofer15|ofer8> diamond11
-f, --fformat <deeploc_loc|deeploc_sol input file processor, default:
|files1|psort|targetp|uniref> deeploc_loc
-c, --criteria <bhat|chi|cos|dot|dpd1 Similarity Criteria, default:bhat
|dpd2|dpd3|dwcos|euclidean|gaussian
|hell|intersection|itakura-saitu|kl
|mahalanobis|max_intersection>
-s, --strategy <acc|discretized|kmers Classification Strategy, default:
|knn_mcp|knn_mcs|knn_stack|propensity acc
|rf_mcp|rf_mcs|rf_mcs_sp|segmentation
|svm_mcp|svm_mcs|svm_stack|voting>
-o, --order <MC order> Specify MC of higher order o,
default:3
-k, --k-fold <k-fold> cross validation k-fold, default:
10
-?, -h, --help display usage information
The following repository contains important datasets that are used extensively in literature:
git clone https://github.com/A-Alaa/protein-localization-datasets.git
In this tutorial, we will benchmark different methods using a dataset generated in this work. The data consists of 13858 proteins entries categorized into 10 subcellular locations, distributed as following:
| Location | #Sequences | Location | #Sequences |
|---|---|---|---|
| Nucleus | 4043 | Cytoplasm | 2542 |
| Extracellular | 1973 | Mitochondrion | 1510 |
| Cell membrane | 1340 | Endoplasmic reticulum | 862 |
| Plastid | 757 | Golgi apparatus | 356 |
| Lysosome/Vacuole | 321 | Peroxisome | 154 |
In order to use this dataset in the pipeline, we need to specify a format processor to retrieve the sequences and their associated labels from the fasta file, so the command now should contain the following parameters:
./MC_pipeline (other options..) -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta
There are two possible Markov chains models can be used:
-
zymc, which implements the Markov model with Zheng Yuan assumption when generalizing the model for higher orders, see: Prediction of protein subcellular locations using Markov chain models. For example, when considering second order Markov model, the Probability ofP(s1s2s3)is decomposed to:.
-
mc, without the above simplification.
For first order Markov model, zymc and mc become indifferent.
For this tutorial, we will assume first-order, so choosing any model of the two won't make a difference. Our command line now becomes:
./MC_pipeline (other options..) -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta
Similarity/Dissimilarity metrics are used to produce the latent representation. The following table lists part of the supported metrics in the project:
| Parameter name | Interpretation |
|---|---|
cos |
Cosine similarity |
kl |
Kullback-Leibler Divergence |
euclidean |
Euclidean Distance |
chi |
Chi-squared Distance |
gaussian |
Gaussian Radial Basis Function |
mahalanobis |
Mahalanobis Distance |
hell |
Hellinger Distance |
intersection |
Intersection Similarity |
For implementation details and other metrics that are not listed in the table, see: src/include/SimilarityMetrics.hpp.
In this tutorial, we will use cos similarity function, which is very stable in performance.
Our command line now becomes:
./MC_pipeline (other options..) -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta
| Parameter name | Interpretation |
|---|---|
propensity |
Traditional inference of Markov chains using maximum propensity. |
svm_mcs |
Classification using Ensemble of binary Support Vector Machines. Each sequence is represented by its latent vector as a feature vector. |
rf_mcs |
Classification using Ensemble of binary Random Forests. Each sequence is represented by its latent vector as a feature vector. |
knn_mcs |
Classification using K-nearest neighbors. Each sequence is represented by its latent vector as a feature vector. |
For this tutorial, we can specify a set of classifiers in a single run to benchmark their performance.
For example, we can use [propensity] alone, if we are interested in maximum propensity based inference, or [propensity,rf_mcs] if we are interested in both the propensity and the classification using the random forest method described in the table.
So, now our command line should look like this:
./MC_pipeline (other options..) -s [propensity,rf_mcs,knn_mcs] -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta
This option allows us to reduce the space of the amino acids. For example, we can reduce each instance of either K or R into single character, so now our peptide is sampled from 19 amino acids instead of 20. In this tutorial, we will assume no grouping are used, we will use amino acids as they are.
The command line should become:
./MC_pipeline (other options..) --grouping nogrouping -s [propensity,rf_mcs,knn_mcs] -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta
Finally, we specify the k parameter as the number of folds of our stratified k-fold cross-validation.
So the final command line becomes:
./MC_pipeline -k 10 --grouping nogrouping -s [propensity,rf_mcs,knn_mcs] -c cos -m mc -o 1 -f deeploc_loc DATASET_FOLDER/deeploc/deeploc_data.fasta