This is the official repository of our paper Peptide design through binding interface mimicry with PepMimic. Our proposed PepMimic is a deep-learning framework for generating short peptides that mimic the binding interface of known protein binders (e.g. natural receptors, antibodies, nanobodies), enabling high-affinity target binding. For targets with no existing binders, our framework can also rely on the in silico pipeline which first generates protein binders with RFDiffusion and the uses PepMimic to mimic the interface (e.g. TROP2) to generate high-affinity peptides. We validate our method across multiple drug targets (PD-L1, CD38, HER2, BCMA, CD4, TROP2) and demonstrate both in vitro binding affinity and in vivo efficacy in tumor models. Thanks for your interest in our work!
env_cu117.yaml and cuda 12.4 with env_cu124.yaml.
conda env create -f env_cu117.yaml # use env_cu124.yaml if for CUDA 12.4
conda activate pepmimic
# install pyrosetta
pip install pyrosetta-installer
python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta()'evaluation/dG/foldx5:
evaluation/dG/foldx5/
βββ foldx_20251231
βββ molecules
βββ yasaraPlugin.zipThe suffix "20251231" denotes the last valid day of usage (2025/12/31) since foldx only provide 1-year license for academic usage and thus needs yearly renewal. After renewal, the path in globals.py also needs to be changed according to the new suffix.
The model weights can be downloaded at the release page.
wget https://github.com/kxz18/PepMimic/releases/download/v1.0/checkpoints.zip
unzip checkpoints.zipPrepare an input folder with reference complexes in PDB format, an index file containing the chain ids of the target protein and the ligand, as well as a configuration indicating parameters like peptide length and number of generations. We have prepared an example folder under example_data/CD38:
example_data/
βββ CD38
βββ 4cmh.pdb
βββ 5f1o.pdb
βββ config.yaml
βββ index.txtHere we also illustrate the meaning of each entry in the config.yaml:
dataset:
test:
class: MimicryDataset
ref_dir: ./example_data/CD38 # The directory for all reference complexes, which should be a relative path rooted at the project folder, or a absolute path
n_sample_per_cplx: 20 # The number of generations for each reference complex. This is just a toy example for a quick tour. For practical usage, we recommend generating a total of above 100,000 candidates before ranking to select the top-scoring one for wetlab tests. For example, here we have two reference complexes, thus we should set n_sample_per_cplx to at least 50,000, so that the total generations will be above 100,000.
length_lb: 10 # lower bound of peptide length (inclusive)
length_ub: 12 # uppper bound of peptide length (inclusive, should <= 25 since the model is trained on peptides below 25 residues)
dataloader:
num_workers: 4 # Number of CPUs for data processing. Usually 4 is enough.
batch_size: 32 # If the GPU is out of memory, please try to reduce the batch sizeEach line of index.txt describes the filename (without .pdb), the target chains, the reference ligand chains, and custom annotations for a reference complex, separated by \t. The custom annotations only work as reminders for yourself, since the program will not access this content. For example, the line for 4cmh.pdb looks like:
4cmh A B,C HEAVY CHAIN OF SAR650984-FAB FRAGMENT,LIGHT CHAIN OF SAR650984-FAB FRAGMENT
After preparation of these input files, you can try the generation with the following script:
# The last number 10 indicates we will finally select the best 10 candidates as the output
GPU=0 bash scripts/mimic.sh example_data/CD38 10The results will be saved under example_data/CD38/final_output.
n_sample_per_cplx to at least 100,000 to let the model explore enough space before final selection. And also, from the experience in our experiments, the Rosetta and FoldX energy does not always rank the positive ones to the top, so it is recommended to synthesize top 100 - 300 peptides for wetlab validation.
We have also prepared the online version for users who prefer using Google Colab. However, we still recommend using the local version due to various constraints on Google Colab (e.g. Running time restriction).
Here we provide the example using the PepBench datasets to train our model, which is the same setting as in our paper. The benchmark includes two datasets:
- train_valid: Training/validation sets of protein-peptide complexes between 4-25 residues extracted from PDB.
- ProtFrag: Augmented datasets from pocket-peptide-like local contexts of monomer structures.
Take the ProtFrag dataset as an example, we can see that each dataset has the following structure:
ProtFrag/
βββ all.txt # an index file
βββ pdbs # all protein-peptide complexes with *.pdb
βββ xxxxxx.pdb
βββ yyyyyy.pdb
βββ zzzzzz.pdb
...Each line of the index file all.txt records the information of each complex, including file name (without extension), target chain ids, peptide chain id, and whether the peptide has non-standard amino acid (optional), separated with \t:
pdb7wtk_0_11 R L 0 # file name, target chain ids, peptide chain id, has non-standard amino acid, separated by \t. Our data processing does not read the last column, so it can be dropped when you are composing your own all.txt.You can also provide custom splits by putting the information of complexes into different index files, just like the train.txt and the valid.txt in the train_valid dataset. Clustering files are also supported as the train/valid.cluster in the train_valid dataset, following the format of file name, cluster name, and cluster size separated by \t:
# contents of train_valid/train.cluster
A_K_pdb4wjv A_K_pdb4wjv 2
D_J_pdb4wjv A_K_pdb4wjv 2
A_B_pdb5yay A_B_pdb5yay 2
B_D_pdb5ybv A_B_pdb5yay 2
B_C_pdb2drn B_C_pdb2drn 4
D_F_pdb2hwn B_C_pdb2drn 4
B_C_pdb2izx B_C_pdb2drn 4
A_E_pdb2hwn B_C_pdb2drn 4
...During training, if the cluster file is provided, the dataset will resample complexes by cluster size, to make the frequency of each cluster equal with each other.
mkdir datasets # all datasets will be put into this directory
# download
wget https://zenodo.org/records/13373108/files/train_valid.tar.gz?download=1 -O ./datasets/train_valid.tar.gz # training/validation
wget https://zenodo.org/records/13373108/files/ProtFrag.tar.gz?download=1 -O ./datasets/ProtFrag.tar.gz # augmentation dataset
# decompress
tar zxvf ./datasets/train_valid.tar.gz -C ./datasets
tar zxvf ./datasets/ProtFrag.tar.gz -C ./datasetsNext we need to process the dataset to the data structure (mmap) recognized by our training framework:
python -m scripts.data_process.process --index ./datasets/train_valid/all.txt --out_dir ./datasets/train_valid/processed # train/validation set
python -m scripts.data_process.process --index ./datasets/ProtFrag/all.txt --out_dir ./datasets/ProtFrag/processed # augmentation datasetFor training and validation sets, we need to get the split index for mmap, which will result in datasets/train_valid/processed/train_index.txt and datasets/train_valid/processed/valid_index.txt:
python -m scripts.data_process.split --train_index datasets/train_valid/train.txt --valid_index datasets/train_valid/valid.txt --processed_dir datasets/train_valid/processed/We have provided a script pipelining each training stage in ./scripts/run_exp_pipe.sh, which can be used as follows:
GPU=0 bash scripts/run_exp_pipe.sh \ # set gpu id with the environment variable GPU, in our experiment, we use a 24G GPU for training
demo_pepmimic \ # <name>: name of your experiment. Results will be saved to ./exps/<name>
./configs/train_autoencoder.yaml \ # <AE config>: Config for training the all-atom variational autoencoder
./configs/pretrain_ldm.yaml \ # <LDM pretrain config>: Config for pretraining latent diffusion model on augmentation datasets
./configs/train_ldm.yaml \ # <LDM config>: Config for finetuing the latent diffusion model on high-quality datasets
./configs/train_ifencoder.yaml \ # <interface encoder config>: Config for training the latent interface encoder, which will be used as mimicry guidance
1111 # <mode>: four digits to indicate whether to train AE/pretrain LDM/finetune LDM/train interface encoder. Usually it should be 1111.In the example above, we train the all-atom autoencoder on both train_valid and ProtFrag. Then we pretrain the latent diffusion model on the augmented dataset (ProtFrag), and finetune it on the high-quality dataset (train_valid). Finally, we train the latent interface encoder on both train_valid and ProtFrag. The finished checkpoint will be located at ./exps/demo_pepmimic/model.ckpt.
Thank you for your interest in our work!
Please feel free to ask about any questions about the algorithms, codes, as well as problems encountered in running them so that we can make it clearer and better. You can either create an issue in the github repo or contact us at [email protected].
If you find the work useful, please cite our paper as:
@article{kong2025peptide,
title={Peptide design through binding interface mimicry with PepMimic},
author={Kong, Xiangzhe and Jiao, Rui and Lin, Haowei and Guo, Ruihan and Huang, Wenbing and Ma, Wei-Ying and Wang, Zihua and Liu, Yang and Ma, Jianzhu},
journal={Nature Biomedical Engineering},
pages={1--16},
year={2025},
doi={10.1038/s41551-025-01507-4}
}