Skip to content

Official codebase for Sequence Display: a platform that integrates large-scale sequence–activity datasets with protein language models to map activity landscapes and identify high-performance protein variants.

License

Notifications You must be signed in to change notification settings

SophieSarceau/SequenceDisplay-ML

Repository files navigation

SequenceDisplay-ML

Official repository for:
"Sequence Display: Generating Large-Scale Sequence–Activity Datasets to Advance Universal Protein Evolution."


sdeml-abstract

Overview

Sequence Display is an experimental–computational platform enabling, for the first time, the large‑scale generation of protein sequence–activity datasets. By coupling these datasets with pre‑trained protein language models (pLMs), the platform reconstructs fine‑grained, variant‑level activity landscapes and accelerates discovery of high‑performance protein variants.
We demonstrate the platform by engineering Staphylococcus lugdunensis Cas9 (SlugCas9) toward broadened PAM recognition.


1. Environment Setup

1.1 Conda Environment

Create and configure the environment:

bash ./env/conda_setup.bash

1.2 Source Code Adjustments

Refer to: ENV_README.


2. Data Preparation

Sequence Display outputs (a) mutated sequence fragments (5 NNK positions) and (b) corresponding activity values (average mutation numbers across four PAM contexts).

Processed data file:
./data/processed/5nnk/5nnk_nng_mut_num.csv

Format:

nnk1,nnk2,nnk3,nnk4,nnk5,count,NNGA,NNGT,NNGC,NNGG
Asn,Asn,Met,Glu,Lys,265,0.7849,0.0981,0.4415,0.9283
Asn,Gln,Leu,Ala,Glu,1725,0.7455,0.1426,0.4046,0.6857

Field description:

  • Columns 1–5: Amino acids observed at the five NNK‑mutated positions (translated form).
  • Column 6 (count): Observed frequency of that 5‑tuple in Sequence Display.
  • Columns 7–10: Average mutation numbers under PAMs NNGA, NNGT, NNGC, NNGG.

Quality filter: Only entries with count > 100 are retained to ensure statistical reliability.


3. Single-Model Training

Two pLM backbones are supported: ESM-2 and SaProt.
Download required pre-trained weights from:
https://drive.google.com/drive/folders/1e6dtjGo7jNfAdiSCkvkubD48l42Vkyax?usp=drive_link
Place files under: ./data/params

Resource guidance:

  • Recommended: ≥ 40 GB GPU memory.
  • Optional tracking: Weights & Biases (wandb) integration (configure in YAML).

3.1 ESM-2

Hyperparameters: ./config/config_esm2_train.yaml
Run:

python train_esm.py

3.2 SaProt

Hyperparameters: ./config/config_saprot_train.yaml
Run:

python train_saprot.py

4. Ensemble Training

Purpose: Improve robustness and enable virtual screening over unobserved 5NNK combinations.
Procedure: 5-fold split; for each fold, train on 4 folds, evaluate on the held‑out fold.
Total models: 10 (5 ESM-2 + 5 SaProt).

4.1 ESM-2 Ensemble

Config: ./config/config_esm2_ensemble.yaml
Run:

python train_esm_ensemble.py

4.2 SaProt Ensemble

Config: ./config/config_saprot_ensemble.yaml
Run:

python train_saprot_ensemble.py

5. Virtual Screening

After ensemble training, screen the remaining (unseen) 5NNK sequence space.

5.1 ESM-2 Virtual Screening

Config: ./config/config_esm2_vs.yaml
Run:

python esm_vs.py

5.2 SaProt Virtual Screening

Pre-tokenize to accelerate inference:

python saprot_vs_batch_conv.py

Then run inference:

python saprot_vs.py

6. License and Attribution

Licensed under Apache 2.0 (see LICENSE).
If you use the code, models, or datasets, cite the Sequence Display manuscript (update with final publication details).
Include a notice of any file modifications.


7. Disclaimer

This repository is for research use. Performance on additional proteins or mutation regimes may require retraining or adaptation.


About

Official codebase for Sequence Display: a platform that integrates large-scale sequence–activity datasets with protein language models to map activity landscapes and identify high-performance protein variants.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages