PREpiBind: Protein Representation-integrated Epitope-MHC Class II Binding Prediction

PREpiBind: Protein Language Model-based MHC Class II Epitope Binding Prediction Hyunyoo Jang et al. bioRxiv (2025) — Paper (coming soon) · HuggingFace Models

PREpiBind predicts MHC class II–peptide binding by leveraging pre-trained protein language model (PLM) representations. It encodes epitope sequences on-the-fly with ESMC 300M and uses pre-computed HLA embeddings for both alpha and beta chains, feeding them into a lightweight cross-attention architecture to produce a binding score.

Quick Start

Option A — Local Jupyter Notebook

git clone https://github.com/daylight-00/PREpiBind
cd PREpiBind
uv sync

Open demo/run.ipynb and run all cells. Cell 1 automatically downloads model weights from HuggingFace. Cell 2 runs inference and saves results to outputs/.

Option B — Google Colab

The Colab notebook provides an interactive widget interface for entering MHC II allele pairs and epitope sequences. No local setup required.

Option C — CLI

1. Install dependencies

git clone https://github.com/daylight-00/PREpiBind
cd PREpiBind
uv sync

2. Download model weights

The ESMC backbone is always required. Download only the PREpiBind checkpoint you intend to use (default: config_demo.py → qualitative). See Available Models for the full list.

from huggingface_hub import hf_hub_download

# Required: ESMC backbone
hf_hub_download(repo_id="daylight00/esmc-300m-2024-12",   filename="esmc_300m_2024_12_v0_fp16.pth",                  local_dir="models")

# PREpiBind checkpoint — download the one(s) you need
hf_hub_download(repo_id="daylight00/prepibind-esmc-300m", filename="prepi_esmc_small_e5_s128_f4_fp16.pth",           local_dir="models")  # qualitative (default)
hf_hub_download(repo_id="daylight00/prepibind-esmc-300m", filename="prepi_esmc_small_ms_e5_s100_f0_fp16.pth",        local_dir="models")  # mass spectrometry
hf_hub_download(repo_id="daylight00/prepibind-esmc-300m", filename="prepi_esmc_small_ic50_500_e5_s128_f4_fp16.pth",  local_dir="models")  # IC50 < 500 nM
hf_hub_download(repo_id="daylight00/prepibind-esmc-300m", filename="prepi_esmc_small_ic50_1000_e5_s128_f1_fp16.pth", local_dir="models")  # IC50 < 1000 nM

3. Run inference

cd demo
python ../code/inference.py config_demo.py --plot

Pass a different config file to select the model (see Available Models):

python ../code/inference.py config_ms.py --plot

Results are saved to demo/outputs/prediction.csv. Use --help for all options:

python ../code/inference.py --help

positional arguments:
  config_path           Path to the config.py file.

options:
  --batch_size          Batch size (default: 512)
  --test_path           Path to input data CSV
  --hla_path            Path to HLA mapping CSV
  --hla_emb_path        Path to HLA embedding HDF5 file
  --chkp_path           Path to PREpiBind model checkpoint
  --esm_chkp_path       Path to ESMC model checkpoint
  --out_path            Path to save output files
  --num_workers         Number of DataLoader workers
  --use_compile         Enable torch.compile
  --plot                Save KDE plot of prediction scores

Available Models

PREpiBind provides four models trained on different measurement types. Pass the corresponding config file to the CLI or set config_path in the notebook.

Config file	Description	Checkpoint
`config_demo.py` (default)	Trained on qualitative binding assay data	`prepi_esmc_small_e5_s128_f4_fp16.pth`
`config_ms.py`	Trained on mass spectrometry eluted ligand data	`prepi_esmc_small_ms_e5_s100_f0_fp16.pth`
`config_ic50_500.py`	Trained on IC50 data, binder threshold < 500 nM	`prepi_esmc_small_ic50_500_e5_s128_f4_fp16.pth`
`config_ic50_1000.py`	Trained on IC50 data, binder threshold < 1000 nM	`prepi_esmc_small_ic50_1000_e5_s128_f1_fp16.pth`

Note: The Colab notebook uses the qualitative model by default for simplicity. To use a different model, edit the chkp_path in Cell 2 directly.

Input Format

The input CSV must contain at least two columns: an MHC allele pair column and an epitope sequence column. The MHC allele pair is two allele names joined by an underscore (e.g., HLA-DRA*01:01_HLA-DRB1*01:01). Allele names must exactly match the keys in the HLA mapping file (data/mhc_mapping.csv).

MHC,Epitope
HLA-DRA*01:01_HLA-DRB1*01:01,PKYVKQNTLKLAT
HLA-DRA*01:01_HLA-DRB5*01:01,AYSAVTTLAEEMK

See demo/data/dataset_demo.csv for a full example (48,352 samples).

HLA Allele Coverage

Set	Alleles	File	Notes
Light (bundled)	134	`demo/data/emb_hla_esmc_small_light_0601_fp16.h5`	Included in repo (Colab default)
Full	7,282	`emb_hla_esmc_small_0601_fp16.h5`	~1 GB, download from HuggingFace

To use the full allele set, download the embedding file and update hla_emb_path and hla_path in your config:

hf_hub_download(repo_id="daylight00/prepibind-esmc-300m", filename="emb_hla_esmc_small_0601_fp16.h5", local_dir="data")

Requirements

	Minimum	Recommended
Python	3.10+	3.11+
GPU VRAM	4 GB	8 GB+
RAM	8 GB	16 GB+

CPU-only inference is supported but significantly slower for large datasets. A CUDA-capable GPU is strongly recommended.

Output

File	Description
`outputs/prediction.csv`	Input data with appended `Logits` and `Score` (sigmoid) columns
`outputs/plot.png`	KDE distribution of prediction scores (generated with `--plot`)

Citation

If you use PREpiBind in your work, please cite:

@article{jang2025prepibind,
  title   = {PREpiBind: Protein Language Model-based MHC Class II Epitope Binding Prediction},
  author  = {Jang, Hyunyoo and others},
  journal = {bioRxiv},
  year    = {2025},
  doi     = {10.1101/XXXX}
}

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
code		code
data		data
demo		demo
train		train
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
banner.png		banner.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PREpiBind: Protein Representation-integrated Epitope-MHC Class II Binding Prediction

Quick Start

Option A — Local Jupyter Notebook

Option B — Google Colab

Option C — CLI

1. Install dependencies

2. Download model weights

3. Run inference

Available Models

Input Format

HLA Allele Coverage

Requirements

Output

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PREpiBind: Protein Representation-integrated Epitope-MHC Class II Binding Prediction

Quick Start

Option A — Local Jupyter Notebook

Option B — Google Colab

Option C — CLI

1. Install dependencies

2. Download model weights

3. Run inference

Available Models

Input Format

HLA Allele Coverage

Requirements

Output

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages