Update-March 2024: We provide a Demo for using USPNet-fast, which takes raw amino acid sequences as input. Tutorial video (in Chinese)
This repository contains code for the paper Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model, which is accepted by Nature Computational Science.
The full text of the paper can also be accessed via the view-only link.
You can use either USPNet or USPNet-fast to predict the signal peptide of a protein sequence.
First, download the repository and create the environment.
requirement
git clone https://github.com/ml4bio/USPNet.git
cd ./USPNet
conda env create -f ./environment.ymlSimilarity-reduced test set table.
All the data mentioned above can also be obtained from our OSF project.
(Place the downloaded model files in the root directory or specify another location with --model_dir argument for prediction scripts.)
USPNet prediction head (without organism group information).
USPNet-fast prediction head (without organism group information).
Specialized trained model optimized with higher accuracy on the major class (Sec/SPI). The model emphasizes the major class through an increased weight on the major class (Sec/SPI) in the objective function.
USPNet-fast prediction head (focus on Sec/SPI, require group information).
Put all the downloaded files into the root directory.
If you want to use USPNet on our benchmark set, please run:
# data processing, data_processed/ folder is created by default
python data_processing.py
#Please put MSA embedding into the data_processed/ folder
python predict.py
# To use a custom model directory:
# python predict.py --model_dir /path/to/model_directory
# categorical benchmark data
unzip test_data.zip
python test.py
# To use a custom model directory:
# python test.py --model_dir /path/to/model_directoryDemo of USPNet on benchmark data without organism group information:
python predict.py --group_info no_group_info
# Custom model directory:
# python predict.py --group_info no_group_info --model_dir /path/to/model_directory
python test.py no_group_info
# Custom model directory:
# python test.py no_group_info --model_dir /path/to/model_directoryDemo of USPNet-fast on benchmark data:
python predict_fast.py
# Custom model directory:
# python predict_fast.py --model_dir /path/to/model_directory
python test_fast.py
# Custom model directory:
# python test_fast.py --model_dir /path/to/model_directoryDemo of USPNet on benchmark data without organism group information:
python predict.py --group_info no_group_info
# Custom model directory:
# python predict.py --group_info no_group_info --model_dir /path/to/model_directory
python test_fast.py no_group_info
# Custom model directory:
# python test_fast.py no_group_info --model_dir /path/to/model_directoryTo generate MSA embeddings on your own protein sequences and use USPNet to perform signal peptide prediction, please run:
# MSA embedding generation. <data_directory_path>: Directory where the processed data will be saved. <msa_directory_path>: Directory for storing MSA files (.a3m).
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path> --msa_dir <msa_directory_path>
# Prediction. Use '--group_info no_group_info' if organism group info is unavailable.
# Use '--model_dir /path/to/model_dir' to specify a custom model directory.
python predict.py --data_dir <data_directory_path>
# Optional:
# python predict.py --data_dir <data_directory_path> --group_info no_group_info
# python predict.py --data_dir <data_directory_path> --model_dir /path/to/model_dirIf you want to use USPNet-fast to perform signal peptide prediction on your own protein sequences, please run:
# Data processing. Processed data is saved in data_processed/ by default.
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path>
# Prediction. Use '--group_info no_group_info' if organism group information is unavailable.
# Use '--model_dir /path/to/model_dir' to specify a custom model directory.
python predict_fast.py --data_dir <data_directory_path>
# Optional:
# python predict_fast.py --data_dir <data_directory_path> --group_info no_group_info
# python predict_fast.py --data_dir <data_directory_path> --model_dir /path/to/model_dir
If you find the models useful in your research, please kindly cite our paper:
@article{shen2024unbiased,
title={Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model},
author={Shen, Junbo and Yu, Qinze and Chen, Shenyang and Tan, Qingxiong and Li, Jingchen and Li, Yu},
journal={Nature Computational Science},
volume={4},
number={1},
pages={29--42},
year={2024},
publisher={Nature Publishing Group US New York}
}