HLA molecules bind to peptides found within and outside the cell. Once a peptide bonds, the HLA-peptide complex are presented on the cell surface to provide information to T-cells. These complexes can serve as antigens that T-cells recognize and target. Cancer cells possess unique peptides that can be presented by HLA. These are known as neoantigens. T-cells can selectively target these neoantigens, but cancers often supress the immune system. Nevertheless, there are ways to make the neoantigen elicit an immune response; as such, identifying them is key to cancer immunotherapies.
Identifying them solely through experimental methods is a powerful but demanding approach. Thus, machine learning models have been developed to help predict if a peptide sequence will bind to HLA or elicit an immune response. However, these models can suffer from bias and are modestly accurate. To reduce single model bias and improve accuracy, I present an ensemble prediction pipeline that runs wild type (WT) and mutant (MUT) sequences through two binding affinity predicting models (netMHCpan and Pick Pocket) and two immunogenicity predicting models (PRIME and DeepImmuno) to obtain a final score that signals the mutant sequence's potential to be a neoantigen.
Required File:
sequences.xlsx
Required Columns:
wt_peptide: WT sequences you want to predictions onmut_peptide: The MUT counterparts for the WT sequences. Note that only missense mutations are supportedid: use numeric values to pair up WT and MUT sequencesallele: If validating the model on experimentally determined neoantigens/non-neoantigens, include the allele the neoantigen/non-neoantigen was tested on. If not validating, leave this column blank
| File Name | Description |
|---|---|
sequences.fasta |
A FASTA file containing all of the sequences in sequences.xlsx. This will be used for predictions for NMP, PP, and PRIME |
DeepImmunoInput.csv |
All of the sequences in sequences.xlsx formatted for DeepImmuno |
predictionsProcessedDI.xlsx |
Processed DeepImmuno predicitions |
predictionsProcessedPrime.xlsx |
Processed PRIME predicitions |
predictionsProcessedPP.xlsx |
Processed Pick Pocket predicitions |
predictionsProcessedNMP.xlsx |
Processed netMHCpan predicitions |
combinedScores.xlsx |
The ensemble score for every WT, MUT, and HLA trio |
results.xlsx |
For every WT and MUT pair, the allele that yielded the highest ensemble score is presented. Also contains the ensemble score for the allele most similar to the validated allele if validating the model |
- Begin by creating a sequences.xlsx file with the columns mentioned above. Run this file through sequencesToFasta.R
- Take the output (sequences.fasta), and run it through netMHCpan, PRIME and Pick Pocket. For general predictions, I would suggest running the predictions across all of the HLA supertype representatives
- Copy and paste the outputs of each of those models into separate txt files
- For DeepImmuno, run the sequences.xlsx through DeepImmunoInput.R to get the csv needed for DeepImmuno Predictions. Make sure the alleles in DeepImmunoInput.R match the ones you made predictions with for the other models
- Run DeepImmuno.csv through DeepImmuno, and copy and paste the output into a txt file
- Run each of your prediction.txt files through their associated modelNameProcessing.R file (e.g. predictionsNMP.txt goes through NMPProcessing.R)
- Currently, you will have to manually manipulate these processed predictions files before you move on by adding some columns
- For netMHCpan and PRIME, we have to normalize the prediction scores for the WT and MUT. I suggest using the following min max normalization formula, =(MAX(column of the rank you're normalizing) - [@WTRank])/(MAX(column of the rank you're normalizing)-MIN(column of the rank you're normalizing)).
- For all models add a Model Score column. The model score formula is the following, =(absolute reference to cell containing the alpha weight)([@NormalMTRank])+(absolute reference to cell containing the beta weight)([@NormalMTRank]-[@NormalWTRank]). Note that you would just use @MTRank or @WTRank for Pick Pocket and DeepImmuno. Additionally, @NormalMTRank should be the name of the column containing your normalized ranks. Lastly, the weights simply let you decide if you want to place more importance on mutants whose predicted scores are high or mutants whose predicted scores are much higher than their WT counterparts.
- Run combinedScores.R on the processed predictions files. You only need to run the script once for all four files
- Open the combined scores file and add a Final Score column that contains the average of all four model scores for each WT, MT, and HLA trio
- Run the output of combineScores.R through results.R to obtain results.xlsx