Official repository for:
"Sequence Display: Generating Large-Scale Sequence–Activity Datasets to Advance Universal Protein Evolution."
Sequence Display is an experimental–computational platform enabling, for the first time, the large‑scale generation of protein sequence–activity datasets. By coupling these datasets with pre‑trained protein language models (pLMs), the platform reconstructs fine‑grained, variant‑level activity landscapes and accelerates discovery of high‑performance protein variants.
We demonstrate the platform by engineering Staphylococcus lugdunensis Cas9 (SlugCas9) toward broadened PAM recognition.
Create and configure the environment:
bash ./env/conda_setup.bashRefer to: ENV_README.
Sequence Display outputs (a) mutated sequence fragments (5 NNK positions) and (b) corresponding activity values (average mutation numbers across four PAM contexts).
Processed data file:
./data/processed/5nnk/5nnk_nng_mut_num.csv
Format:
nnk1,nnk2,nnk3,nnk4,nnk5,count,NNGA,NNGT,NNGC,NNGG
Asn,Asn,Met,Glu,Lys,265,0.7849,0.0981,0.4415,0.9283
Asn,Gln,Leu,Ala,Glu,1725,0.7455,0.1426,0.4046,0.6857
Field description:
- Columns 1–5: Amino acids observed at the five NNK‑mutated positions (translated form).
- Column 6 (count): Observed frequency of that 5‑tuple in Sequence Display.
- Columns 7–10: Average mutation numbers under PAMs NNGA, NNGT, NNGC, NNGG.
Quality filter: Only entries with count > 100 are retained to ensure statistical reliability.
Two pLM backbones are supported: ESM-2 and SaProt.
Download required pre-trained weights from:
https://drive.google.com/drive/folders/1e6dtjGo7jNfAdiSCkvkubD48l42Vkyax?usp=drive_link
Place files under: ./data/params
Resource guidance:
- Recommended: ≥ 40 GB GPU memory.
- Optional tracking: Weights & Biases (wandb) integration (configure in YAML).
Hyperparameters: ./config/config_esm2_train.yaml
Run:
python train_esm.pyHyperparameters: ./config/config_saprot_train.yaml
Run:
python train_saprot.pyPurpose: Improve robustness and enable virtual screening over unobserved 5NNK combinations.
Procedure: 5-fold split; for each fold, train on 4 folds, evaluate on the held‑out fold.
Total models: 10 (5 ESM-2 + 5 SaProt).
Config: ./config/config_esm2_ensemble.yaml
Run:
python train_esm_ensemble.pyConfig: ./config/config_saprot_ensemble.yaml
Run:
python train_saprot_ensemble.pyAfter ensemble training, screen the remaining (unseen) 5NNK sequence space.
Config: ./config/config_esm2_vs.yaml
Run:
python esm_vs.pyPre-tokenize to accelerate inference:
python saprot_vs_batch_conv.pyThen run inference:
python saprot_vs.pyLicensed under Apache 2.0 (see LICENSE).
If you use the code, models, or datasets, cite the Sequence Display manuscript (update with final publication details).
Include a notice of any file modifications.
This repository is for research use. Performance on additional proteins or mutation regimes may require retraining or adaptation.
