Skip to content

Commit a08fd77

Browse files
mpolson64facebook-github-bot
authored andcommitted
Intro to BO
Summary: Basic doc introducing BO concepts like surrogate models, acquisition functions, etc. Differential Revision: D69267374
1 parent 10c35f3 commit a08fd77

File tree

5 files changed

+126
-1
lines changed

5 files changed

+126
-1
lines changed

docs/assets/ei.png

226 KB
Loading

docs/assets/gpei.gif

203 KB
Loading

docs/assets/surrogate.png

181 KB
Loading

docs/intro-to-bo.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
---
2+
id: intro-to-bo
3+
title: Introduction to Bayesian Optimization
4+
---
5+
6+
# Introduction to Bayesian Optimization
7+
8+
Bayesian optimization (BO) is a highly effective adaptive experimentation method
9+
that excels at balancing exploration (learning how new parameterizations
10+
perform) and exploitation (refining parameterizations previously observed to be
11+
good). This method is the backbone of Ax's optimization.
12+
13+
BO has seen widespread use across a variety of domains. Notable examples include
14+
its use in
15+
[tuning the hyperparameters of AlphaGo](https://www.nature.com/articles/nature16961),
16+
a landmark model that defeated world champions in the board game Go. In
17+
materials science, researchers used BO to accelerate the curing process,
18+
increase the overall strength, and reduce the CO2 emissions of
19+
[concrete formulations](https://arxiv.org/abs/2310.18288), the most abundant
20+
human-made material in history. In chemistry, researchers used it to
21+
[discover 21 new, state-of-the-art molecules for tunable dye lasers](https://www.science.org/doi/10.1126/science.adk9227)
22+
(frequently used in quantum physics research), including the world’s brightest
23+
molecule, while only a dozen or so had been discovered over the course of
24+
decades.
25+
26+
Ax relies on [BoTorch](https://botorch.org/) for its implementation of
27+
state-of-the-art Bayesian optimization components.
28+
29+
## Bayesian Optimization
30+
31+
Bayesian optimization begins by building a smooth surrogate model of the
32+
outcomes using a statistical model. This surrogate model can be used to make
33+
predictions at unobserved parameterizations and quantify the uncertainty around
34+
them. The predictions and the uncertainty estimates are combined to derive an
35+
acquisition function, which quantifies the value of observing a particular
36+
parameterization. By optimizing the acquisition function we find the best
37+
candidate parameterizations which we are then able to evaluate. We iteratively
38+
fit the surrogate model with newly observed data, optimize the acquisition
39+
function to find the best configuration to observe, then fit a new surrogate
40+
model with the newly observed outcomes. The entire process is adaptive in the
41+
sense that the predictions and uncertainty estimates are updated as new
42+
observations are made.
43+
44+
The strategy of relying on successive surrogate models to update knowledge of
45+
the objective allows BO to strike a balance between the conflicting goals of
46+
exploration (trying out parameterizations with high uncertainty in their
47+
outcomes) and exploitation (converging on configurations that are likely to be
48+
good). As a result, BO is able to find better configurations with fewer
49+
evaluations than is generally possible with grid search or other global
50+
optimization techniques. This makes it a good choice for applications where a
51+
limited number of function evaluations can be made.
52+
53+
## Surrogate Models
54+
55+
Because the objective function is a black box process, we treat it as a random
56+
function and place a prior over it. This prior captures beliefs about the
57+
objective, and it is updated as data is observed to form the posterior.
58+
59+
This is typically done using a Gaussian process (GP), a probabilistic model that
60+
defines a probability distribution over possible functions that fit a set of
61+
points. Importantly for Bayesian Optimization, GPs can be used to map points in
62+
input space (the parameters we wish to tune) to distributions in output space
63+
(the objectives we wish to optimize).
64+
65+
In the one-dimensional example below, a surrogate model is fitted to five noisy
66+
observations using GPs to predict the objective (solid line) and place
67+
uncertainty estimates (proportional to the width of the shaded bands) over the
68+
entire x-axis, which represents the range of possible parameter values.
69+
Importantly, the model is able to predict the outcome and quantify the
70+
uncertainty of configurations that have not yet been tested. Intuitively, the
71+
uncertainty bands are tight in regions that are well-explored and become wider
72+
as we move away from them.
73+
74+
![GP surrogate model](assets/surrogate.png)
75+
76+
## Acquisition Functions
77+
78+
The acquisition function is a mathematical function that quantifies the utility
79+
of observing a given point in the domain. Ax supports the most commonly used
80+
acquisition functions in BO, including:
81+
82+
- **Expected Improvement (EI)**, which captures the expected value of a point
83+
above the current best value.
84+
- **Probability of Improvement (PI)**, which captures the probability of a point
85+
producing an observation better than the current best value.
86+
- **Upper Confidence Bound (UCB)**, which sums the predicted mean and standard
87+
deviation.
88+
89+
Each of these acquisition functions will lead to different behavior during the
90+
optimization. Expected improvement is a popular acquisition function owing to
91+
its natural tendency to both explore regions of high uncertainty and exploit
92+
regions known to be good, an analytic form that is easy to compute, and overall
93+
good practical performance. As the name suggests, it rewards evaluation of the
94+
objective $$f$$ based on the expected improvement relative to the current best.
95+
If $$f^* = \max_i y_i$$ is the current best observed outcome and our goal is to
96+
maximize $f$, then EI is defined as the following:
97+
98+
$$
99+
\text{EI}(x) = \mathbb{E}\bigl[\max(f(x) - f^*, 0)\bigr]
100+
$$
101+
102+
A visualization of the expected improvement based on the surrogate model
103+
predictions is shown below, where the next suggestion is where the expected
104+
improvement is at its maximum.
105+
106+
![Expected Improvement (EI) acquisition function](assets/ei.png)
107+
108+
Once a new highest EI is selected and evaluated, the surrogate model is
109+
retrained and a new suggestion is made. This process continues in a loop until a
110+
stopping condition set by the user is reached.
111+
112+
![Full Bayesian optimization loop](assets/gpei.gif)
113+
114+
Using an acquisition function like EI to sample new points initially promotes
115+
quick exploration because its values, like the uncertainty estimates, are higher
116+
in unexplored regions. Once the parameter space is adequately explored, EI
117+
naturally narrows in on locations where there is a high likelihood of a good
118+
objective value.
119+
120+
While the combination of a Gaussian process surrogate model and the expected
121+
improvement acquisition function is shown above, different combinations of
122+
surrogate models and acquisition functions can be used. Different surrogates,
123+
either GPs with different behaviors or entirely different probabilistic models,
124+
or different acquisition functions present various tradeoffs in terms of
125+
optimization performance, computational load, and more.

website/sidebars.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ const tutorials = () => {
4545

4646
export default {
4747
docs: {
48-
Introduction: ['why-ax', 'intro-to-ae'],
48+
Introduction: ['why-ax', 'intro-to-ae', 'intro-to-bo'],
4949
},
5050
tutorials: tutorials(),
5151
};

0 commit comments

Comments
 (0)