This repository contains code used for utilizing machine-learning models to design enzyme sequences with improved activity. The models rely on upper confidence bound (UCB) optimization, a strategy that balances exploiting regions of the sequence-function landscape that are highly active, and exploring new regions of the landscape.
- Data_Analysis_and_Protabank_File_Prep
- Contains code (as a Jupyter notebook) used to update the nomenclature of enzyme names and prepare data in a format suitable for deposition in the ProtaBank database (Relabeling_and_Protabank_Prep folder)
- Also, contains workbook for analyzing the raw data and preparing plots for figures in the manuscript (In Analysis folder)
- The code to generate figures for comparing the cross validation results in each round is in a separate folder (UCB_Regularization_and_Cross_Val_by_round_Supplementary_Figure_6)
- GC_Analysis_Demo
- Contains raw GC data from one representative experiment, the scripts used to parse and analyze it and the output file
- Mutual_Information_Algorithm
- Inputs and scripts used to design sequences in the initial test set
- UCB_Optimization_Data
- Inputs for UCB Optimization and required files for running the analysis script
- Analysis scripts
- Guide_For_UCB_Rounds.xlsx provides the filenames of the analysis script and the corresponding inputs for each round
- UCB_Optimization_Demo
- UCB_tools.py module, which includes a variety of functions that
- Train and fit a Gaussian Process (GP) regression model (UCB_tools.GPfit)
- Cross validate GP a models (UCB_tools.GP_cross_val, UCB_tools.GP_cross_val_scan)
- Employ scikit-learns Gaussian Naive Bayes (GNB) classification package to classify sequences (UCB_tools.get_active_predictions, UCB_tools.gnb_fit)
- Evaluate performance of GNB models (UCB_tools.gnb_cross_val)
- Carry out batch mode UCB optimization to select sequences (UCB_tools.select_new_sequences)
- Analysis script from one round, as well as the input file, and the output files generated by the analysis script
- UCB_tools.py module, which includes a variety of functions that
The algorithms in these scripts have been tested on the following systems:
- macOS: High Sierra (10.13)
- macOS: Catalina (10.15.7)
- numpy 1.17.2
- scikit-learn 0.21.3
- scipy 1.2.1
- matplotlib 3.1.3
- pandas 1.0.3
- seaborn 0.10.0
- The contents are available on GitHub at https://github.com/RomeroLab/ML-Guided-Acyl-ACP-Reductase-Engineering
- The time to download the scripts and any necessary Python dependencies should be fairly short.
- Package the raw chromatograph data using either
package_gc_data.pyorpackage_gc_data_v2.py.package_gc_data.pymust be run from within an environment, such as ajupyter-notebookorspyder.package_gc_data_v2.pycan be run directly from the command line. The input for the function is the name of the folder containing the raw text files of the chromatographs. A command such aspython package_gc_data_v2.py gc_07-23-2020would package all of the chromatographs into a datafile calledgc_07-23-2020.p. - The gc_tools package can then be used to process the data. This can be done in a script within spyder (as was done in
gc_07-23-2020.py), in a jupyter notebook, or by running the script via the terminal (python gc_07-23-2020.py, though this command won't produce plots used in the course of the analysis). Briefly: - The pickled data file generated from the packaging function is loaded
- Parameters relevant to convert peak areas (such as extraction volumes, retention time boundaries, etc.) are specified
- The peaks are retrieved (using a function like
gc_tools.get_peaks_manual) - The retention time windows are adjusted manually as needed
- The peaks are integrated
- Concentrations are calculated using internal standards
- This script should produce a csv file containing concentrations of various fatty alcohols, with their corrseponding labels
- The script takes 30-60 seconds to run. The analysis may take longer if manual adjustment of retention times is necessary
- If multiple batches of data are available (i.e. multiple GC runs), these must be combined.
- Any replicates must be averaged. This can be done either in Python or in Microsoft Excel (or a similar program).
- A block alignment file must be supplied as a pickled file. A csv block alignment is included.
- If using a structural representation, a list of contacting residues must also be included. This file should be a pickled dictionary with each entry in the format
{(resi_x,resi_y): weight}, whereresi_1andresi_2are residue numbers, andweightis the importance of the contact, or it's abundance in an ensemble of structural models.
- A representation should be selected (either structural, sequence, or block), and the appropriate function from
UCB_toolsshould be selected to load that representation. - For example, in the demo script, the command is
UCB_tools.structure_representation('acr_block_aln.p',contact_dict,csvfile), wherecsvfileis the name of the file containing the processed results from GC analysis. This step generates a sequence encoding (X), transforms the titers (act) into the label vector Y (also called activity), and generates a few other intermediary variables. - An optional distance matrix can be generated to limit the number of allowed block substitutions for a chimeric sequence (using the
calc_hamming_distancefunction provided in the demo script).
- The
function UCB_tools.get_active_sequencecan be used to transform the continuous labels into a binary vector useful for classification - The Gaussian Naive Bayes Classifier is trained using the binary data and the sequence encoding, and then applied to predict the activity of all untested sequences.
UCB_tools.get_active_predictionsis used to identify the block sequences that are predicted to be activeUCB_tools.gnb_cross_valutilizes Leave One Out Cross Validation to validate the classifier. This function also produces a ROC curve and confusion matrix.- Gaussian Process (GP) regression models are then used to map the encodings of active sequences to their activity (Y). Leave one out cross validation is used to tune the models (
UCB_tools.GP_cross_val_scan). - The hyperparameter,
lambda, is selected by either 1) maximizing the correlation coefficient, 2) minimizing the squared error of the model, or 3) arbitrarily selecting a point that comes close to either or both of these objectives but doesn't meet them. - New sequences are designed using
UCB_tools.select_new_sequences. - An optional final step,
UCB_tools.check_for_old_assemblies, can be employed to check the designed sequences against a list of previously attempted sequences (i.e. sequences that were previously designed, but unsuccessfully assembled).
- This process will produce a ist of sequences that maximize the UCB for the GP models.
- Depending on the number of data points and the encoding method used, the script can take 1-2 minutes to run.