Code for ProSST: A Pre-trained Protein Sequence and Structure Transformer with Disentangled Attention. (NeurIPS 2024)
- Our MSA-Enhanced model VenusREM has achieved 0.518 Spearman's rho in the ProteinGym benchmark.
git clone https://github.com/ai4protein/ProSST.git
cd ProSST
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd)from prosst.structure.get_sst_seq import SSTPredictor
predictor = SSTPredictor(structure_vocab_size=2048) # can be 20, 128, 512, 1024, 2048, 4096
result = predictor.predict_from_pdb('example_data/p1.pdb')Output:
[407, 998, 1841, 1421, 653, 450, 117, 822, ...]
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("AI4Protein/ProSST-2048", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AI4Protein/ProSST-2048", trust_remote_code=True)See AI4Protein/ProSST-* for more models.
Zero-shot mutant effect prediction
Download dataset from Google Driver. (This file contains quantized structures within ProteinGYM).
Original PDB dataset is the same as ProtSSN, which can be downloaded from Huggingface.
cd example_data
unzip proteingym_benchmark.zippython zero_shot/proteingym_benchmark.py --model_path AI4Protein/ProSST-2048 \
--structure_dir example_data/structure_sequence/2048If you use ProSST in your research, please cite the following paper:
@inproceedings{li2024prosst,
title={{ProSST}: Protein Language Modeling with Quantized Structure and Disentangled Attention},
author={Mingchen Li and Yang Tan and Xinzhu Ma and Bozitao Zhong and Huiqun Yu and Ziyi Zhou and Wanli Ouyang and Bingxin Zhou and Pan Tan and Liang Hong},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}
This project is licensed under the terms of the CC-BY-NC-ND-4.0 license.
