This project aims to use active learning algorithm and Thermo Fisher's integrated robotics system to fully automate optimization of qPCR parameters, such as GC content of the primers, amplification length, etc., evaluated on the basis of cycle threshold (Ct) values. A lower Ct value signifies a more efficient qPCR. Active learning is an appropriate algorithm in this scenario because the labels of the samples are not known a priori, and the model selects the experiments that would improve its predictions along the process, thereby fully automating experimentation, including experimental design, the traditionally manual step.
The active learning algorithm and data analysis are written in Python. The liquid handling script is put together using CyBio Scripting for operating the CyBio Felix liquid handler. qPCR is carried out using QuantStudio 7. Plate transfer is done by the Spinnaker robotic arm, and scheduling is implemented using the Momentum software.
We use qPCR to amplify chickens' (Gallus gallus) cytochrome B sequence. The forward and reserve primers are recommended by Primer-BLAST (https://www.ncbi.nlm.nih.gov/tools/primer-blast/). We select the top ten pairs suggested by Primer-BLAST results, and create the experimental space by
- pairing each forward primer with each reverse primer, excluding any pairs that have negative amplification length (i.e. no overlap between the amplified sequences from the forward versus the reserve primer) or positive amplification length less than 30 nucleotides, leaving us with 7 different forward primers and 10 different reverse primers to purchase;
- for each valid forward and reverse primer pair, create experiments with four different magnesium concentrations (1.3, 2.0, 2.7, 3.4 mM).
There are a total of 245 unique sets of experimental conditions, meaning that the experimental space contains 245 unlabeled instances before any experiments.
The different primers will contain different levels of GC content, which could affect the Ct value. Using different pairs of forward and reverse primers will also result in different amplification lengths, termed dist (or distance between forward and reverse primers) in the results section. Different magnesium concentration can affect DNA polymerase reactivity. Therefore, an active learning model is trained on GC content of forward and reverse primers respectively, amplification lengths, and magnesium concentrations to predict Ct values and produce the next set of instances to query (twelve queries per iteration) after each iteration. The active learner queries the experimental space using uncertainty sampling, relying on a Random Forest Regressor as its base learner. Effectively, it means that the learner selected the unobserved/unlabeled samples with the largest standard deviation of the predictions made among the estimators (i.e. trees) in the random forest. The number of trees in the random forest is 3, and the random state is set to 0 whenever a random process is involved. At iteration 0, the learner randomly selects twelve samples from the unobserved data (unlabeled instances) and uses these to train a random forest regressor.
A 96-well plate is used to carry out each iteration of the experiment. Each column of the plate contains one unique set of experimental conditions, resulting in 12 sets per plate and 8 replicates per set. After the experiments are plated by CyBio Felix, QuantStudio 7 carries out qPCR and generates Ct values for all experiments. These Ct values are the labels for the selected experiments, or the now observed samples. The updated observed samples dataset is then used to train the Random Forest Regressor, and the active learner produces the next round of experiments accordingly.
Given the time and resource constraints, we were able to carry out four rounds of experiments (iteration 0-3). To analyze the improvement of the model after each iteration of the experiment yet without ground truth labels for all experiments, we used the experiments done (i.e. observed samples) in the last iteration as the test dataset to evaluate the performance of the Random Forest Regressors after each of the first three iterations of the experiment. The mean-squared error (MSE) between true and predicted Ct values of the test dataset is calculated, and the accuracy improvement is illustrated in figure 1 (see figures section).
We then used our model trained on 3 iterations (iteration 0-2) to predict Ct values of all instances, and picked out the top five combinations with the lowest Ct values shown in figure 2 (see figures section).
As shown in figure 1, the MSE decreases with more training instances, suggesting model improvement. It is important to note that the test dataset consists of the results from iteration 3, meaning that the experiments selected for iteration 3 are the most uncertain unlabeled instances for the model in iteration 2. This means that we are making the model predict on instances it is unsure about, hence very likely overestimating the mean squared error at iteration 2. Therefore, our model could be more accurate in general than what is shown in the graph.
If our model were to be accurate, then the top five conditions in figure 2 show that the optimal parameters of a qPCR experiment has a magnesium concentration of 2.7. Out of all possible GC content (of either forward or reverse primers) in the experimental space (0.5, 0.55, 0.6), higher levels (0.55 or 0.6) are favored. Surprisingly, the optimal amplification lengths are in the high range (see figures 2 and 3). This could be an artifact from having limited labeled samples. Even in a successful run, many of our replicates failed to produce good curves with measurable Ct values. The poor experimental results could have confounded our predictions.
We mainly focused on building a more accurate model, such that if it were to predict certain experiments with low Ct values, these predicted values were likely to be true and thus these parameters more likely to be the optimal ones. This is a exploration-only method. We could modify the selection method to balance exploration and exploitation, both selecting uncertain experiments and doing so in a space that is more likely to lead us to the minimum Ct value. One way of doing this is by calculating Excepted Improvement, with the lowest Ct value seen thus far being the threshold. See https://papers.nips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf
Had we been able to do more experiments, we could improve the the experimental design by having various sets of the experiments from initializing iteration 0 with different random seeds, hence having multiple Random Forest Regressors after each iteration. That way, instead of reporting a single MSE value, we could show the mean and standard deviations, making the results more generalizable and more robust.
https://docs.google.com/document/d/1sP_vgnYW408Uj3hsdFtRiA0Pp10eM8J18MR6luEgzpk/edit?usp=sharing