|
| 1 | +--- |
| 2 | +id: intro-to-bo |
| 3 | +title: Introduction to Bayesian Optimization |
| 4 | +--- |
| 5 | + |
| 6 | +# Introduction to Bayesian Optimization |
| 7 | + |
| 8 | +Bayesian optimization (BO) is a highly effective adaptive experimentation method |
| 9 | +that excels at balancing exploration (learning how new parameterizations |
| 10 | +perform) and exploitation (refining parameterizations previously observed to be |
| 11 | +good). This method is the backbone of Ax's optimization. |
| 12 | + |
| 13 | +BO has seen widespread use across a variety of domains. Notable examples include |
| 14 | +its use in |
| 15 | +[tuning the hyperparameters of AlphaGo](https://www.nature.com/articles/nature16961), |
| 16 | +a landmark model that defeated world champions in the board game Go. In |
| 17 | +materials science, researchers used BO to accelerate the curing process, |
| 18 | +increase the overall strength, and reduce the CO2 emissions of |
| 19 | +[concrete formulations](https://arxiv.org/abs/2310.18288), the most abundant |
| 20 | +human-made material in history. In chemistry, researchers used it to |
| 21 | +[discover 21 new, state-of-the-art molecules for tunable dye lasers](https://www.science.org/doi/10.1126/science.adk9227) |
| 22 | +(frequently used in quantum physics research), including the world’s brightest |
| 23 | +molecule, while only a dozen or so had been discovered over the course of |
| 24 | +decades. |
| 25 | + |
| 26 | +Ax relies on [BoTorch](https://botorch.org/) for its implementation of |
| 27 | +state-of-the-art Bayesian optimization components. |
| 28 | + |
| 29 | +## Bayesian Optimization |
| 30 | + |
| 31 | +Bayesian optimization begins by building a smooth surrogate model of the |
| 32 | +outcomes using a statistical model. This surrogate model can be used to make |
| 33 | +predictions at unobserved parameterizations and quantify the uncertainty around |
| 34 | +them. The predictions and the uncertainty estimates are combined to derive an |
| 35 | +acquisition function, which quantifies the value of observing a particular |
| 36 | +parameterization. By optimizing the acquisition function we find the best |
| 37 | +candidate parameterizations which we are then able to evaluate. We iteratively |
| 38 | +fit the surrogate model with newly observed data, optimize the acquisition |
| 39 | +function to find the best configuration to observe, then fit a new surrogate |
| 40 | +model with the newly observed outcomes. The entire process is adaptive in the |
| 41 | +sense that the predictions and uncertainty estimates are updated as new |
| 42 | +observations are made. |
| 43 | + |
| 44 | +The strategy of relying on successive surrogate models to update knowledge of |
| 45 | +the objective allows BO to strike a balance between the conflicting goals of |
| 46 | +exploration (trying out parameterizations with high uncertainty in their |
| 47 | +outcomes) and exploitation (converging on configurations that are likely to be |
| 48 | +good). As a result, BO is able to find better configurations with fewer |
| 49 | +evaluations than is generally possible with grid search or other global |
| 50 | +optimization techniques. This makes it a good choice for applications where a |
| 51 | +limited number of function evaluations can be made. |
| 52 | + |
| 53 | +## Surrogate Models |
| 54 | + |
| 55 | +Because the objective function is a black box process, we treat it as a random |
| 56 | +function and place a prior over it. This prior captures beliefs about the |
| 57 | +objective, and it is updated as data is observed to form the posterior. |
| 58 | + |
| 59 | +This is typically done using a Gaussian process (GP), a probabilistic model that |
| 60 | +defines a probability distribution over possible functions that fit a set of |
| 61 | +points. Importantly for Bayesian Optimization, GPs can be used to map points in |
| 62 | +input space (the parameters we wish to tune) to distributions in output space |
| 63 | +(the objectives we wish to optimize). |
| 64 | + |
| 65 | +In the one-dimensional example below, a surrogate model is fitted to five noisy |
| 66 | +observations using GPs to predict the objective (solid line) and place |
| 67 | +uncertainty estimates (proportional to the width of the shaded bands) over the |
| 68 | +entire x-axis, which represents the range of possible parameter values. |
| 69 | +Importantly, the model is able to predict the outcome and quantify the |
| 70 | +uncertainty of configurations that have not yet been tested. Intuitively, the |
| 71 | +uncertainty bands are tight in regions that are well-explored and become wider |
| 72 | +as we move away from them. |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +## Acquisition Functions |
| 77 | + |
| 78 | +The acquisition function is a mathematical function that quantifies the utility |
| 79 | +of observing a given point in the domain. Ax supports the most commonly used |
| 80 | +acquisition functions in BO, including: |
| 81 | + |
| 82 | +- **Expected Improvement (EI)**, which captures the expected value of a point |
| 83 | + above the current best value. |
| 84 | +- **Probability of Improvement (PI)**, which captures the probability of a point |
| 85 | + producing an observation better than the current best value. |
| 86 | +- **Upper Confidence Bound (UCB)**, which sums the predicted mean and standard |
| 87 | + deviation. |
| 88 | + |
| 89 | +Each of these acquisition functions will lead to different behavior during the |
| 90 | +optimization. Expected improvement is a popular acquisition function owing to |
| 91 | +its natural tendency to both explore regions of high uncertainty and exploit |
| 92 | +regions known to be good, an analytic form that is easy to compute, and overall |
| 93 | +good practical performance. As the name suggests, it rewards evaluation of the |
| 94 | +objective $$f$$ based on the expected improvement relative to the current best. |
| 95 | +If $$f^* = \max_i y_i$$ is the current best observed outcome and our goal is to |
| 96 | +maximize $f$, then EI is defined as the following: |
| 97 | + |
| 98 | +$$ |
| 99 | +\text{EI}(x) = \mathbb{E}\bigl[\max(f(x) - f^*, 0)\bigr] |
| 100 | +$$ |
| 101 | + |
| 102 | +A visualization of the expected improvement based on the surrogate model |
| 103 | +predictions is shown below, where the next suggestion is where the expected |
| 104 | +improvement is at its maximum. |
| 105 | + |
| 106 | + |
| 107 | + |
| 108 | +Once a new highest EI is selected and evaluated, the surrogate model is |
| 109 | +retrained and a new suggestion is made. This process continues in a loop until a |
| 110 | +stopping condition set by the user is reached. |
| 111 | + |
| 112 | + |
| 113 | + |
| 114 | +Using an acquisition function like EI to sample new points initially promotes |
| 115 | +quick exploration because its values, like the uncertainty estimates, are higher |
| 116 | +in unexplored regions. Once the parameter space is adequately explored, EI |
| 117 | +naturally narrows in on locations where there is a high likelihood of a good |
| 118 | +objective value. |
| 119 | + |
| 120 | +While the combination of a Gaussian process surrogate model and the expected |
| 121 | +improvement acquisition function is shown above, different combinations of |
| 122 | +surrogate models and acquisition functions can be used. Different surrogates, |
| 123 | +either GPs with different behaviors or entirely different probabilistic models, |
| 124 | +or different acquisition functions present various tradeoffs in terms of |
| 125 | +optimization performance, computational load, and more. |
0 commit comments