diff --git a/docs/assets/ei.png b/docs/assets/ei.png new file mode 100644 index 00000000000..5a45ff74ab1 Binary files /dev/null and b/docs/assets/ei.png differ diff --git a/docs/assets/gpei.gif b/docs/assets/gpei.gif new file mode 100644 index 00000000000..78bc83e68f5 Binary files /dev/null and b/docs/assets/gpei.gif differ diff --git a/docs/assets/surrogate.png b/docs/assets/surrogate.png new file mode 100644 index 00000000000..4b255908346 Binary files /dev/null and b/docs/assets/surrogate.png differ diff --git a/docs/intro-to-bo.md b/docs/intro-to-bo.md new file mode 100644 index 00000000000..7ccfeffb2e7 --- /dev/null +++ b/docs/intro-to-bo.md @@ -0,0 +1,128 @@ +--- +id: intro-to-bo +title: Introduction to Bayesian Optimization +--- + +# Introduction to Bayesian Optimization + +Bayesian optimization (BO) is a highly effective adaptive experimentation method +that excels at balancing exploration (learning how new parameterizations +perform) and exploitation (refining parameterizations previously observed to be +good). This method is the foundation of Ax's optimization. + +BO has seen widespread use across a variety of domains. Notable examples include +its use in +[tuning the hyperparameters of AlphaGo](https://www.nature.com/articles/nature16961), +a landmark model that defeated world champions in the board game Go. In +materials science, researchers used BO to accelerate the curing process, +increase the overall strength, and reduce the CO2 emissions of +[concrete formulations](https://arxiv.org/abs/2310.18288), the most abundant +human-made material in history. In chemistry, researchers used it to +[discover 21 new, state-of-the-art molecules for tunable dye lasers](https://www.science.org/doi/10.1126/science.adk9227) +(frequently used in quantum physics research), including the world’s brightest +molecule, while only a dozen or so had been discovered over the course of +decades. + +Ax relies on [BoTorch](https://botorch.org/) for its implementation of +state-of-the-art Bayesian optimization components. + +## Bayesian Optimization + +Bayesian optimization begins by building a smooth surrogate model of the +outcomes using a statistical model. This surrogate model makes predictions at +unobserved parameterizations and estimate the uncertainty around them. The +predictions and the uncertainty estimates are combined to derive an acquisition +function, which quantifies the value of observing a particular parameterization. +By optimizing the acquisition function we identify the best candidate +parameterizations for evaluation. In an iterative process, we fit the surrogate +model with newly observed data, optimize the acquisition function to identify +the best configuration to observe, then fit a new surrogate model with the newly +observed outcomes. The entire process is adaptive where the predictions and +uncertainty estimates are updated as new observations are made. + +The strategy of relying on successive surrogate models to update knowledge of +the objective allows BO to strike a balance between the conflicting goals of +exploration (trying out parameterizations with high uncertainty in their +outcomes) and exploitation (converging on configurations that are likely to be +good). As a result, BO is able to find better configurations with fewer +evaluations than is generally possible with grid search or other global +optimization techniques. Therefore, leveraging BO as is done in Ax, is +particularly impactful for applications where the evaluation process is +expensive, allowing for only a limited number of evaluations + +## Surrogate Models + +Because the objective function is a black-box process, we treat it as a random +function and place a prior over it. This prior captures beliefs about the +objective, and it is updated as data is observed to form the posterior. + +This is typically done using a Gaussian process (GP), a probabilistic model that +defines a probability distribution over possible functions that fit a set of +points. Importantly for Bayesian Optimization, GPs can be used to map points in +input space (the parameters we wish to tune) to distributions in output space +(the objectives we wish to optimize). + +In the one-dimensional example below, a surrogate model is fit to five noisy +observations using a GP to predict the objective, depicted by the solid line, +and uncertainty estimates, illustrated by the width of the shaded bands. This +objective is predicted for the entire range of possible parameter values, +corresponding to the full x-axis. Importantly, the model is able to predict the +outcome and quantify the uncertainty of configurations that have not yet been +tested. Intuitively, the uncertainty bands are tight in regions that are +well-explored and become wider as we move away from them. + +![GP surrogate model](assets/surrogate.png) + +## Acquisition Functions + +The acquisition function is a mathematical function that quantifies the utility +of observing a given point in the domain. Ax supports the most commonly used +acquisition functions in BO, including: + +- **Expected Improvement (EI)**, which captures the expected value of a point + above the current best value. +- **Probability of Improvement (PI)**, which captures the probability of a point + producing an observation better than the current best value. +- **Upper Confidence Bound (UCB)**, which sums the predicted mean and standard + deviation. + +Each of these acquisition functions will lead to different behavior during the +optimization. Additionally, many of these acquisition functions have been +extended to perform well in constrained, noisy, multi-objective, and/or batched +settings. + +Expected Improvement is a popular acquisition function owing to well balanced +exploitation vs exploration, a straighforward analytic form, and overall good +practical performance. As the name suggests, it rewards evaluation of the +objective $$f$$ based on the expected improvement relative to the current best. +If $$f^* = \max_i y_i$$ is the current best observed outcome and our goal is to +maximize $f$, then EI is defined as the following: + +$$ +\text{EI}(x) = \mathbb{E}\bigl[\max(f(x) - f^*, 0)\bigr] +$$ + +A visualization of the expected improvement based on the surrogate model +predictions is shown below, where the next suggestion is where the expected +improvement is at its maximum. + +![Expected Improvement (EI) acquisition function](assets/ei.png) + +Once a new highest EI is selected and evaluated, the surrogate model is +retrained and a new suggestion is made. As described above, this process +continues iteratively until a stopping condition, set by the user, is reached. + +![Full Bayesian optimization loop](assets/gpei.gif) + +Using an acquisition function like EI to sample new points initially promotes +quick exploration because the expected values, informed by the uncertainty +estimates, are higher in unexplored regions. Once the parameter space is +adequately explored, EI naturally narrows focuses on regions where there is a +high likelihood of a good objective value (ie exploitation). + +While the combination of a Gaussian process surrogate model and the expected +improvement acquisition function is shown above, different combinations of +surrogate models and acquisition functions can be used. Different surrogates, +either differently configured GPs or entirely different probabilistic models, or +different acquisition functions present various tradeoffs in terms of +optimization performance, computational load, and more. diff --git a/website/sidebars.js b/website/sidebars.js index ff93a53b95e..8d7d88454e1 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -45,7 +45,7 @@ const tutorials = () => { export default { docs: { - Introduction: ['why-ax', 'intro-to-ae'], + Introduction: ['why-ax', 'intro-to-ae', 'intro-to-bo'], }, tutorials: tutorials(), };