Skip to content

ai4protein/ProSST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Venus-ProSST

Code for ProSST: A Pre-trained Protein Sequence and Structure Transformer with Disentangled Attention. (NeurIPS 2024)

News

  • Our MSA-Enhanced model VenusREM has achieved 0.518 Spearman's rho in the ProteinGym benchmark.

1 Install

git clone https://github.com/ai4protein/ProSST.git
cd ProSST
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd)

2 Structure quantizer

Structure quantizer

ProSST Structure Quantizer

from prosst.structure.get_sst_seq import SSTPredictor
predictor = SSTPredictor(structure_vocab_size=2048) # can be 20, 128, 512, 1024, 2048, 4096
result = predictor.predict_from_pdb('example_data/p1.pdb')

Output:

[407, 998, 1841, 1421, 653, 450, 117, 822, ...]

3 ProSST models have been uploaded to huggingface 🤗 Transformers

from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("AI4Protein/ProSST-2048", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AI4Protein/ProSST-2048", trust_remote_code=True)

See AI4Protein/ProSST-* for more models.

4 Zero-shot mutant effect prediction

4.1 Example notebook

Zero-shot mutant effect prediction

4.2 Run ProteinGYM Benchmark

Download dataset from Google Driver. (This file contains quantized structures within ProteinGYM).

Original PDB dataset is the same as ProtSSN, which can be downloaded from Huggingface.

cd example_data
unzip proteingym_benchmark.zip
python zero_shot/proteingym_benchmark.py --model_path AI4Protein/ProSST-2048 \
--structure_dir example_data/structure_sequence/2048

Citation

If you use ProSST in your research, please cite the following paper:

@inproceedings{li2024prosst,
    title={{ProSST}: Protein Language Modeling with Quantized Structure and Disentangled Attention},
    author={Mingchen Li and Yang Tan and Xinzhu Ma and Bozitao Zhong and Huiqun Yu and Ziyi Zhou and Wanli Ouyang and Bingxin Zhou and Pan Tan and Liang Hong},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024}
}

This project is licensed under the terms of the CC-BY-NC-ND-4.0 license.

About

🧬 Advanced hybrid language model for directed protein evolution. (NeurIPS 2024)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •