diff --git a/docs/assets/ei.png b/docs/assets/ei.png new file mode 100644 index 00000000000..5a45ff74ab1 Binary files /dev/null and b/docs/assets/ei.png differ diff --git a/docs/assets/gpei.gif b/docs/assets/gpei.gif new file mode 100644 index 00000000000..78bc83e68f5 Binary files /dev/null and b/docs/assets/gpei.gif differ diff --git a/docs/assets/surrogate.png b/docs/assets/surrogate.png new file mode 100644 index 00000000000..4b255908346 Binary files /dev/null and b/docs/assets/surrogate.png differ diff --git a/docs/intro-to-bo.md b/docs/intro-to-bo.md new file mode 100644 index 00000000000..023a677b2a8 --- /dev/null +++ b/docs/intro-to-bo.md @@ -0,0 +1,129 @@ +--- +id: intro-to-bo +title: Introduction to Bayesian Optimization +--- + +# Introduction to Bayesian Optimization + +Bayesian optimization (BO) is a highly effective adaptive experimentation method +that excels at balancing exploration (learning how new parameterizations +perform) and exploitation (refining parameterizations previously observed to be +good). This method is the foundation of Ax's optimization. + +BO has seen widespread use across a variety of domains. Notable examples include +its use in +[tuning the hyperparameters of AlphaGo](https://www.nature.com/articles/nature16961), +a landmark model that defeated world champions in the board game Go. In +materials science, researchers used BO to accelerate the curing process, +increase the overall strength, and reduce the CO2 emissions of +[concrete formulations](https://arxiv.org/abs/2310.18288), the most abundant +human-made material in history. In chemistry, researchers used it to +[discover 21 new, state-of-the-art molecules for tunable dye lasers](https://www.science.org/doi/10.1126/science.adk9227) +(frequently used in quantum physics research), including the world’s brightest +molecule, while only a dozen or so had been discovered over the course of +decades. + +Ax relies on [BoTorch](https://botorch.org/) for its implementation of +state-of-the-art Bayesian optimization components. + +## Bayesian Optimization + +Bayesian optimization begins by building a smooth surrogate model of the +outcomes using a statistical model. This surrogate model makes predictions at +unobserved parameterizations and estimate the uncertainty around them. The +predictions and the uncertainty estimates are combined to derive an acquisition +function, which quantifies the value of observing a particular parameterization. +By optimizing the acquisition function we identify the best candidate +parameterizations for evaluation. In an iterative process, we fit the surrogate +model with newly observed data, optimize the acquisition function to identify +the best configuration to observe, then fit a new surrogate model with the newly +observed outcomes. The entire process is adaptive where the predictions and +uncertainty estimates are updated as new observations are made. + +The strategy of relying on successive surrogate models to update knowledge of +the objective allows BO to strike a balance between the conflicting goals of +exploration (trying out parameterizations with high uncertainty in their +outcomes) and exploitation (converging on configurations that are likely to be +good). As a result, BO is able to find better configurations with fewer +evaluations than is generally possible with grid search or other global +optimization techniques. Therefore, leveraging BO as is done in Ax, is +particularly impactful for applications where the evaluation process is +expensive, allowing for only a limited number of evaluations + +## Surrogate Models + +Because the objective function is a black-box process, we treat it as a random +function and place a prior over it. This prior captures beliefs about the +objective, and it is updated as data is observed to form the posterior. + +This is typically done using a Gaussian process (GP), a probabilistic model that +defines a probability distribution over possible functions that fit a set of +points. Importantly for Bayesian Optimization, GPs can be used to map points in +input space (the parameters we wish to tune) to distributions in output space +(the objectives we wish to optimize). + +In the one-dimensional example below, a surrogate model is fitted to five noisy +observations using GPs to predict the objective (solid line) and place +uncertainty estimates (proportional to the width of the shaded bands) over the +entire x-axis, which represents the range of possible parameter values. +Importantly, the model is able to predict the outcome and quantify the +uncertainty of configurations that have not yet been tested. Intuitively, the +uncertainty bands are tight in regions that are well-explored and become wider +as we move away from them. + +![GP surrogate model](assets/surrogate.png) + +## Acquisition Functions + +The acquisition function is a mathematical function that quantifies the utility +of observing a given point in the domain. Ax supports the most commonly used +acquisition functions in BO, including: + +- **Expected Improvement (EI)**, which captures the expected value of a point + above the current best value. +- **Probability of Improvement (PI)**, which captures the probability of a point + producing an observation better than the current best value. +- **Upper Confidence Bound (UCB)**, which sums the predicted mean and standard + deviation. + +Each of these acquisition functions will lead to different behavior during the +optimization. Additionally, many of these acquisition functions have been +extended to perform well in constrained, noisy, multi-objective, and/or batched +settings. + +Expected improvement is a popular acquisition function owing to its natural +tendency to both explore regions of high uncertainty and exploit regions known +to be good, an analytic form that is easy to compute, and overall good practical +performance. As the name suggests, it rewards evaluation of the objective $$f$$ +based on the expected improvement relative to the current best. If +$$f^* = \max_i y_i$$ is the current best observed outcome and our goal is to +maximize $f$, then EI is defined as the following: + +$$ +\text{EI}(x) = \mathbb{E}\bigl[\max(f(x) - f^*, 0)\bigr] +$$ + +A visualization of the expected improvement based on the surrogate model +predictions is shown below, where the next suggestion is where the expected +improvement is at its maximum. + +![Expected Improvement (EI) acquisition function](assets/ei.png) + +Once a new highest EI is selected and evaluated, the surrogate model is +retrained and a new suggestion is made. This process continues in a loop until a +stopping condition set by the user is reached. + +![Full Bayesian optimization loop](assets/gpei.gif) + +Using an acquisition function like EI to sample new points initially promotes +quick exploration because its values, like the uncertainty estimates, are higher +in unexplored regions. Once the parameter space is adequately explored, EI +naturally narrows in on locations where there is a high likelihood of a good +objective value. + +While the combination of a Gaussian process surrogate model and the expected +improvement acquisition function is shown above, different combinations of +surrogate models and acquisition functions can be used. Different surrogates, +either GPs with different behaviors or entirely different probabilistic models, +or different acquisition functions present various tradeoffs in terms of +optimization performance, computational load, and more. diff --git a/website/sidebars.js b/website/sidebars.js index ff93a53b95e..8d7d88454e1 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -45,7 +45,7 @@ const tutorials = () => { export default { docs: { - Introduction: ['why-ax', 'intro-to-ae'], + Introduction: ['why-ax', 'intro-to-ae', 'intro-to-bo'], }, tutorials: tutorials(), };