Skip to content

BackusLab/Ensemble-Prediction-Pipeline-To-Identify-Neoantigens

 
 

Repository files navigation

Ensemble Prediction Pipeline to Identify Neoantigens

Overview

HLA molecules bind to peptides found within and outside the cell. Once a peptide bonds, the HLA-peptide complex are presented on the cell surface to provide information to T-cells. These complexes can serve as antigens that T-cells recognize and target. Cancer cells possess unique peptides that can be presented by HLA. These are known as neoantigens. T-cells can selectively target these neoantigens, but cancers often supress the immune system. Nevertheless, there are ways to make the neoantigen elicit an immune response; as such, identifying them is key to cancer immunotherapies.

Identifying them solely through experimental methods is a powerful but demanding approach. Thus, machine learning models have been developed to help predict if a peptide sequence will bind to HLA or elicit an immune response. However, these models can suffer from bias and are modestly accurate. To reduce single model bias and improve accuracy, I present an ensemble prediction pipeline that runs wild type (WT) and mutant (MUT) sequences through two binding affinity predicting models (netMHCpan and Pick Pocket) and two immunogenicity predicting models (PRIME and DeepImmuno) to obtain a final score that signals the mutant sequence's potential to be a neoantigen.

Required File:
sequences.xlsx

Required Columns:

  • wt_peptide: WT sequences you want to predictions on
  • mut_peptide: The MUT counterparts for the WT sequences. Note that only missense mutations are supported
  • id: use numeric values to pair up WT and MUT sequences
  • allele: If validating the model on experimentally determined neoantigens/non-neoantigens, include the allele the neoantigen/non-neoantigen was tested on. If not validating, leave this column blank

Output

File Name Description
sequences.fasta A FASTA file containing all of the sequences in sequences.xlsx. This will be used for predictions for NMP, PP, and PRIME
DeepImmunoInput.csv All of the sequences in sequences.xlsx formatted for DeepImmuno
predictionsProcessedDI.xlsx Processed DeepImmuno predicitions
predictionsProcessedPrime.xlsx Processed PRIME predicitions
predictionsProcessedPP.xlsx Processed Pick Pocket predicitions
predictionsProcessedNMP.xlsx Processed netMHCpan predicitions
combinedScores.xlsx The ensemble score for every WT, MUT, and HLA trio
results.xlsx For every WT and MUT pair, the allele that yielded the highest ensemble score is presented. Also contains the ensemble score for the allele most similar to the validated allele if validating the model

How to Run the Pipeline

Links to all models

1. Obtaining Predictions

  • Begin by creating a sequences.xlsx file with the columns mentioned above. Run this file through sequencesToFasta.R
  • Take the output (sequences.fasta), and run it through netMHCpan, PRIME and Pick Pocket. For general predictions, I would suggest running the predictions across all of the HLA supertype representatives
  • Copy and paste the outputs of each of those models into separate txt files
  • For DeepImmuno, run the sequences.xlsx through DeepImmunoInput.R to get the csv needed for DeepImmuno Predictions. Make sure the alleles in DeepImmunoInput.R match the ones you made predictions with for the other models
  • Run DeepImmuno.csv through DeepImmuno, and copy and paste the output into a txt file

2. Processing Predictions

  • Run each of your prediction.txt files through their associated modelNameProcessing.R file (e.g. predictionsNMP.txt goes through NMPProcessing.R)
  • Currently, you will have to manually manipulate these processed predictions files before you move on by adding some columns
    • For netMHCpan and PRIME, we have to normalize the prediction scores for the WT and MUT. I suggest using the following min max normalization formula, =(MAX(column of the rank you're normalizing) - [@WTRank])/(MAX(column of the rank you're normalizing)-MIN(column of the rank you're normalizing)).
    • For all models add a Model Score column. The model score formula is the following, =(absolute reference to cell containing the alpha weight)([@NormalMTRank])+(absolute reference to cell containing the beta weight)([@NormalMTRank]-[@NormalWTRank]). Note that you would just use @MTRank or @WTRank for Pick Pocket and DeepImmuno. Additionally, @NormalMTRank should be the name of the column containing your normalized ranks. Lastly, the weights simply let you decide if you want to place more importance on mutants whose predicted scores are high or mutants whose predicted scores are much higher than their WT counterparts.
  • Run combinedScores.R on the processed predictions files. You only need to run the script once for all four files
    • Open the combined scores file and add a Final Score column that contains the average of all four model scores for each WT, MT, and HLA trio
  • Run the output of combineScores.R through results.R to obtain results.xlsx

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • R 100.0%