Skip to content

Commit 657151f

Browse files
authored
NP Regression Documentation
1 parent c55d7a9 commit 657151f

File tree

2 files changed

+397
-359
lines changed

2 files changed

+397
-359
lines changed

docs/acquisition.md

+207-172
Original file line numberDiff line numberDiff line change
@@ -1,172 +1,207 @@
1-
---
2-
id: acquisition
3-
title: Acquisition Functions
4-
---
5-
6-
Acquisition functions are heuristics employed to evaluate the usefulness of one
7-
of more design points for achieving the objective of maximizing the underlying
8-
black box function.
9-
10-
BoTorch supports both analytic as well as (quasi-) Monte-Carlo based acquisition
11-
functions. It provides a generic
12-
[`AcquisitionFunction`](../api/acquisition.html#acquisitionfunction) API that
13-
abstracts away from the particular type, so that optimization can be performed
14-
on the same objects.
15-
16-
17-
## Monte Carlo Acquisition Functions
18-
19-
Many common acquisition functions can be expressed as the expectation of some
20-
real-valued function of the model output(s) at the design point(s):
21-
22-
$$
23-
\alpha(X) = \mathbb{E}\bigl[ a(\xi) \mid
24-
\xi \sim \mathbb{P}(f(X) \mid \mathcal{D}) \bigr]
25-
$$
26-
27-
where $X = (x_1, \dotsc, x_q)$, and $\mathbb{P}(f(X) \mid \mathcal{D})$ is the
28-
posterior distribution of the function $f$ at $X$ given the data $\mathcal{D}$
29-
observed so far.
30-
31-
Evaluating the acquisition function thus requires evaluating an integral over
32-
the posterior distribution. In most cases, this is analytically intractable. In
33-
particular, analytic expressions generally do not exist for batch acquisition
34-
functions that consider multiple design points jointly (i.e. $q > 1$).
35-
36-
An alternative is to use Monte-Carlo (MC) sampling to approximate the integrals.
37-
An MC approximation of $\alpha$ at $X$ using $N$ MC samples is
38-
39-
$$ \alpha(X) \approx \frac{1}{N} \sum_{i=1}^N a(\xi_{i}) $$
40-
41-
where $\xi_i \sim \mathbb{P}(f(X) \mid \mathcal{D})$.
42-
43-
For instance, for q-Expected Improvement (qEI), we have:
44-
45-
$$
46-
\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q}
47-
\bigl\\{ \max(\xi_{ij} - f^\*, 0) \bigr\\},
48-
\qquad \xi_{i} \sim \mathbb{P}(f(X) \mid \mathcal{D})
49-
$$
50-
51-
where $f^\*$ is the best function value observed so far (assuming noiseless
52-
observations). Using the reparameterization trick ([^KingmaWelling2014],
53-
[^Rezende2014]),
54-
55-
$$
56-
\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q}
57-
\bigl\\{ \max\bigl( \mu(X)\_j + (L(X) \epsilon_i)\_j - f^\*, 0 \bigr) \bigr\\},
58-
\qquad \epsilon_{i} \sim \mathcal{N}(0, I)
59-
$$
60-
61-
where $\mu(X)$ is the posterior mean of $f$ at $X$, and $L(X)L(X)^T = \Sigma(X)$
62-
is a root decomposition of the posterior covariance matrix.
63-
64-
All MC-based acquisition functions in BoTorch are derived from
65-
[`MCAcquisitionFunction`](../api/acquisition.html#mcacquisitionfunction).
66-
67-
Acquisition functions expect input tensors $X$ of shape
68-
$\textit{batch_shape} \times q \times d$, where $d$ is the dimension of the
69-
feature space, $q$ is the number of points considered jointly, and
70-
$\textit{batch_shape}$ is the batch-shape of the input tensor. The output
71-
$\alpha(X)$ will have shape $\textit{batch_shape}$, with each element
72-
corresponding to the respective $q \times d$ batch tensor in the input $X$.
73-
Note that for analytic acquisition functions, it must be that $q=1$.
74-
75-
### MC, q-MC, and Fixed Base Samples
76-
77-
BoTorch relies on the re-parameterization trick and (quasi)-Monte-Carlo sampling
78-
for optimization and estimation of the batch acquisition functions [^Wilson2017].
79-
The results below show the reduced variance when estimating an expected
80-
improvement (EI) acquisition function using base samples obtained via quasi-MC
81-
sampling versus standard MC sampling.
82-
83-
![MC_qMC](assets/EI_MC_qMC.png)
84-
85-
In the plots above, the base samples used to estimate each point are resampled.
86-
As discussed in the [Overview](./overview), a single set of base samples can be
87-
used for optimization when the re-parameterization trick is employed. What are the
88-
trade-offs between using a fixed set of base samples versus re-sampling on every
89-
MC evaluation of the acquisition function? Below, we show that fixing base samples
90-
produces functions that are potentially much easier to optimize, without resorting to
91-
stochastic optimization methods.
92-
93-
![resampling_fixed](assets/EI_resampling_fixed.png)
94-
95-
If the base samples are fixed, the problem of optimizing the acquisition function
96-
is deterministic, allowing for conventional quasi-second order methods such as
97-
L-BFGS or sequential least-squares programming (SLSQP) to be used. These have
98-
faster convergence rates than first-order methods and can speed up acquisition
99-
function optimization significantly.
100-
101-
One concern is that the approximated acquisition function is *biased* for any
102-
fixed set of base samples, which may adversely affect the solution. However, we
103-
find that in practice, both the optimal value and the optimal solution of these
104-
biased problems for standard acquisition functions converge quite rapidly to
105-
their true counterparts as more samples are used. Note that for evaluation of
106-
the acquisition function we integrate over a $qo$-dimensional space (where
107-
$q$ is the number of points in the q-batch and $o$ is the number of outputs
108-
included in the objective). Therefore, the MC integration problem can be quite
109-
low-dimensional even for models on high-dimensional feature spaces (large $d$).
110-
Because using additional samples is relatively cheap computationally,
111-
we default to 500 base samples in the MC acquisition functions.
112-
113-
On the other hand, when re-sampling is used in conjunction with a stochastic
114-
optimization algorithm, the kind of bias noted above is no longer a concern.
115-
The trade-off here is that the optimization may be less effective, as discussed
116-
above.
117-
118-
119-
## Analytic Acquisition Functions
120-
121-
BoTorch also provides implementations of analytic acquisition functions that
122-
do not depend on MC sampling. These acquisition functions are subclasses of
123-
[`AnalyticAcquisitionFunction`](../api/acquisition.html#analyticacquisitionfunction)
124-
and only exist for the case of a single candidate point ($q = 1$). These
125-
include classical acquisition functions such as Expected Improvement (EI),
126-
Upper Confidence Bound (UCB), and Probability of Improvement (PI). An example
127-
comparing [`ExpectedImprovement`](../api/acquisition.html#expectedimprovement),
128-
the analytic version of EI, to it's MC counterpart
129-
[`qExpectedImprovement`](../api/acquisition.html#qexpectedimprovement)
130-
can be found in
131-
[this tutorial](../tutorials/compare_mc_analytic_acquisition).
132-
133-
Analytic acquisition functions allow for an explicit expression in terms of the
134-
summary statistics of the posterior distribution at the evaluated point(s).
135-
A popular acquisition function is Expected Improvement of a single point
136-
for a Gaussian posterior, given by
137-
138-
$$ \text{EI}(x) = \mathbb{E}\bigl[
139-
\max(y - f^\*, 0) \mid y\sim \mathcal{N}(\mu(x), \sigma^2(x))
140-
\bigr] $$
141-
142-
where $\mu(x)$ and $\sigma(x)$ are the posterior mean and variance of $f$ at the
143-
point $x$, and $f^\*$ is again the best function value observed so far (assuming
144-
noiseless observations). It can be shown that
145-
146-
$$ \text{EI}(x) = \sigma(x) \bigl( z \Phi(z) + \varphi(z) \bigr)$$
147-
148-
where $z = \frac{\mu(x) - f_{\max}}{\sigma(x)}$ and $\Phi$ and $\varphi$ are
149-
the cdf and pdf of the standard normal distribution, respectively.
150-
151-
With some additional work, it is also possible to express the gradient of
152-
the Expected Improvement with respect to the design $x$. Classic Bayesian
153-
Optimization software will implement this gradient function explicitly, so that
154-
it can be used for numerically optimizing the acquisition function.
155-
156-
BoTorch, in contrast, harnesses PyTorch's automatic differentiation feature
157-
("autograd") in order to obtain gradients of acquisition functions. This makes
158-
implementing new acquisition functions much less cumbersome, as it does not
159-
require to analytically derive gradients. All that is required is that the
160-
operations performed in the acquisition function computation allow for the
161-
back-propagation of gradient information through the posterior and the model.
162-
163-
164-
[^KingmaWelling2014]: D. P. Kingma, M. Welling. Auto-Encoding Variational Bayes.
165-
ICLR, 2013.
166-
167-
[^Rezende2014]: D. J. Rezende, S. Mohamed, D. Wierstra. Stochastic
168-
Backpropagation and Approximate Inference in Deep Generative Models. ICML, 2014.
169-
170-
[^Wilson2017]: J. T. Wilson, R. Moriconi, F. Hutter, M. P. Deisenroth.
171-
The Reparameterization Trick for Acquisition Functions. NeurIPS Workshop on
172-
Bayesian Optimization, 2017.
1+
---
2+
id: acquisition
3+
title: Acquisition Functions
4+
---
5+
6+
Acquisition functions are heuristics employed to evaluate the usefulness of one
7+
of more design points for achieving the objective of maximizing the underlying
8+
black box function.
9+
10+
BoTorch supports both analytic as well as (quasi-) Monte-Carlo based acquisition
11+
functions. It provides a generic
12+
[`AcquisitionFunction`](../api/acquisition.html#acquisitionfunction) API that
13+
abstracts away from the particular type, so that optimization can be performed
14+
on the same objects.
15+
16+
17+
## Monte Carlo Acquisition Functions
18+
19+
Many common acquisition functions can be expressed as the expectation of some
20+
real-valued function of the model output(s) at the design point(s):
21+
22+
$$
23+
\alpha(X) = \mathbb{E}\bigl[ a(\xi) \mid
24+
\xi \sim \mathbb{P}(f(X) \mid \mathcal{D}) \bigr]
25+
$$
26+
27+
where $X = (x_1, \dotsc, x_q)$, and $\mathbb{P}(f(X) \mid \mathcal{D})$ is the
28+
posterior distribution of the function $f$ at $X$ given the data $\mathcal{D}$
29+
observed so far.
30+
31+
Evaluating the acquisition function thus requires evaluating an integral over
32+
the posterior distribution. In most cases, this is analytically intractable. In
33+
particular, analytic expressions generally do not exist for batch acquisition
34+
functions that consider multiple design points jointly (i.e. $q > 1$).
35+
36+
An alternative is to use Monte-Carlo (MC) sampling to approximate the integrals.
37+
An MC approximation of $\alpha$ at $X$ using $N$ MC samples is
38+
39+
$$ \alpha(X) \approx \frac{1}{N} \sum_{i=1}^N a(\xi_{i}) $$
40+
41+
where $\xi_i \sim \mathbb{P}(f(X) \mid \mathcal{D})$.
42+
43+
For instance, for q-Expected Improvement (qEI), we have:
44+
45+
$$
46+
\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q}
47+
\bigl\\{ \max(\xi_{ij} - f^\*, 0) \bigr\\},
48+
\qquad \xi_{i} \sim \mathbb{P}(f(X) \mid \mathcal{D})
49+
$$
50+
51+
where $f^\*$ is the best function value observed so far (assuming noiseless
52+
observations). Using the reparameterization trick ([^KingmaWelling2014],
53+
[^Rezende2014]),
54+
55+
$$
56+
\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q}
57+
\bigl\\{ \max\bigl( \mu(X)\_j + (L(X) \epsilon_i)\_j - f^\*, 0 \bigr) \bigr\\},
58+
\qquad \epsilon_{i} \sim \mathcal{N}(0, I)
59+
$$
60+
61+
where $\mu(X)$ is the posterior mean of $f$ at $X$, and $L(X)L(X)^T = \Sigma(X)$
62+
is a root decomposition of the posterior covariance matrix.
63+
64+
All MC-based acquisition functions in BoTorch are derived from
65+
[`MCAcquisitionFunction`](../api/acquisition.html#mcacquisitionfunction).
66+
67+
Acquisition functions expect input tensors $X$ of shape
68+
$\textit{batch_shape} \times q \times d$, where $d$ is the dimension of the
69+
feature space, $q$ is the number of points considered jointly, and
70+
$\textit{batch_shape}$ is the batch-shape of the input tensor. The output
71+
$\alpha(X)$ will have shape $\textit{batch_shape}$, with each element
72+
corresponding to the respective $q \times d$ batch tensor in the input $X$.
73+
Note that for analytic acquisition functions, it must be that $q=1$.
74+
75+
### MC, q-MC, and Fixed Base Samples
76+
77+
BoTorch relies on the re-parameterization trick and (quasi)-Monte-Carlo sampling
78+
for optimization and estimation of the batch acquisition functions [^Wilson2017].
79+
The results below show the reduced variance when estimating an expected
80+
improvement (EI) acquisition function using base samples obtained via quasi-MC
81+
sampling versus standard MC sampling.
82+
83+
![MC_qMC](assets/EI_MC_qMC.png)
84+
85+
In the plots above, the base samples used to estimate each point are resampled.
86+
As discussed in the [Overview](./overview), a single set of base samples can be
87+
used for optimization when the re-parameterization trick is employed. What are the
88+
trade-offs between using a fixed set of base samples versus re-sampling on every
89+
MC evaluation of the acquisition function? Below, we show that fixing base samples
90+
produces functions that are potentially much easier to optimize, without resorting to
91+
stochastic optimization methods.
92+
93+
![resampling_fixed](assets/EI_resampling_fixed.png)
94+
95+
If the base samples are fixed, the problem of optimizing the acquisition function
96+
is deterministic, allowing for conventional quasi-second order methods such as
97+
L-BFGS or sequential least-squares programming (SLSQP) to be used. These have
98+
faster convergence rates than first-order methods and can speed up acquisition
99+
function optimization significantly.
100+
101+
One concern is that the approximated acquisition function is *biased* for any
102+
fixed set of base samples, which may adversely affect the solution. However, we
103+
find that in practice, both the optimal value and the optimal solution of these
104+
biased problems for standard acquisition functions converge quite rapidly to
105+
their true counterparts as more samples are used. Note that for evaluation of
106+
the acquisition function we integrate over a $qo$-dimensional space (where
107+
$q$ is the number of points in the q-batch and $o$ is the number of outputs
108+
included in the objective). Therefore, the MC integration problem can be quite
109+
low-dimensional even for models on high-dimensional feature spaces (large $d$).
110+
Because using additional samples is relatively cheap computationally,
111+
we default to 500 base samples in the MC acquisition functions.
112+
113+
On the other hand, when re-sampling is used in conjunction with a stochastic
114+
optimization algorithm, the kind of bias noted above is no longer a concern.
115+
The trade-off here is that the optimization may be less effective, as discussed
116+
above.
117+
118+
119+
## Analytic Acquisition Functions
120+
121+
BoTorch also provides implementations of analytic acquisition functions that
122+
do not depend on MC sampling. These acquisition functions are subclasses of
123+
[`AnalyticAcquisitionFunction`](../api/acquisition.html#analyticacquisitionfunction)
124+
and only exist for the case of a single candidate point ($q = 1$). These
125+
include classical acquisition functions such as Expected Improvement (EI),
126+
Upper Confidence Bound (UCB), and Probability of Improvement (PI). An example
127+
comparing [`ExpectedImprovement`](../api/acquisition.html#expectedimprovement),
128+
the analytic version of EI, to it's MC counterpart
129+
[`qExpectedImprovement`](../api/acquisition.html#qexpectedimprovement)
130+
can be found in
131+
[this tutorial](../tutorials/compare_mc_analytic_acquisition).
132+
133+
Analytic acquisition functions allow for an explicit expression in terms of the
134+
summary statistics of the posterior distribution at the evaluated point(s).
135+
A popular acquisition function is Expected Improvement of a single point
136+
for a Gaussian posterior, given by
137+
138+
$$ \text{EI}(x) = \mathbb{E}\bigl[
139+
\max(y - f^\*, 0) \mid y\sim \mathcal{N}(\mu(x), \sigma^2(x))
140+
\bigr] $$
141+
142+
where $\mu(x)$ and $\sigma(x)$ are the posterior mean and variance of $f$ at the
143+
point $x$, and $f^\*$ is again the best function value observed so far (assuming
144+
noiseless observations). It can be shown that
145+
146+
$$ \text{EI}(x) = \sigma(x) \bigl( z \Phi(z) + \varphi(z) \bigr)$$
147+
148+
where $z = \frac{\mu(x) - f_{\max}}{\sigma(x)}$ and $\Phi$ and $\varphi$ are
149+
the cdf and pdf of the standard normal distribution, respectively.
150+
151+
With some additional work, it is also possible to express the gradient of
152+
the Expected Improvement with respect to the design $x$. Classic Bayesian
153+
Optimization software will implement this gradient function explicitly, so that
154+
it can be used for numerically optimizing the acquisition function.
155+
156+
BoTorch, in contrast, harnesses PyTorch's automatic differentiation feature
157+
("autograd") in order to obtain gradients of acquisition functions. This makes
158+
implementing new acquisition functions much less cumbersome, as it does not
159+
require to analytically derive gradients. All that is required is that the
160+
operations performed in the acquisition function computation allow for the
161+
back-propagation of gradient information through the posterior and the model.
162+
163+
164+
[^KingmaWelling2014]: D. P. Kingma, M. Welling. Auto-Encoding Variational Bayes.
165+
ICLR, 2013.
166+
167+
[^Rezende2014]: D. J. Rezende, S. Mohamed, D. Wierstra. Stochastic
168+
Backpropagation and Approximate Inference in Deep Generative Models. ICML, 2014.
169+
170+
[^Wilson2017]: J. T. Wilson, R. Moriconi, F. Hutter, M. P. Deisenroth.
171+
The Reparameterization Trick for Acquisition Functions. NeurIPS Workshop on
172+
Bayesian Optimization, 2017.
173+
174+
## Latent Information Gain
175+
176+
In the high-dimensional spatiotemporal domain, Expected Information Gain becomes
177+
less informative for useful observations, and it can be difficult to calculate
178+
its parameters. To overcome these limitations, we propose a novel acquisition
179+
function by computing the expected information gain in the latent space rather
180+
than the observational space. To design this acquisition function,
181+
we prove the equivalence between the expected information gain
182+
in the observational space and the expected KL divergence in the
183+
latent processes w.r.t. a candidate parameter 𝜃, as illustrated by the
184+
following proposition.
185+
186+
Proposition 1. The expected information gain (EIG) for Neural
187+
Process is equivalent to the KL divergence between the prior and
188+
posterior in the latent process, that is
189+
190+
$$ \text{EIG}(\hat{x}_{1:T}, \theta) := \mathbb{E} \left[ H(\hat{x}_{1:T}) -
191+
H(\hat{x}_{1:T} \mid z_{1:T}, \theta) \right]
192+
= \mathbb{E}_{p(\hat{x}_{1:T} \mid \theta)}
193+
\text{KL} \left( p(z_{1:T} \mid \hat{x}_{1:T}, \theta) \,\|\, p(z_{1:T}) \right)
194+
$$
195+
196+
197+
Inspired by this fact, we propose a novel acquisition function computing the
198+
expected KL divergence in the latent processes and name it LIG. Specifically,
199+
the trained NP model produces a variational posterior given the current dataset.
200+
For every parameter $$\theta$$ remained in the search space, we can predict
201+
$$\hat{x}_{1:T}$$ with the decoder. We use $$\hat{x}_{1:T}$$ and $$\theta$$
202+
as input to the encoder to re-evaluate the posterior. LIG computes the
203+
distributional difference with respect to the latent process.
204+
[Wu2023arxiv]:
205+
Wu, D., Niu, R., Chinazzi, M., Vespignani, A., Ma, Y.-A., & Yu, R. (2023).
206+
Deep Bayesian Active Learning for Accelerating Stochastic Simulation.
207+
arXiv preprint arXiv:2106.02770. Retrieved from https://arxiv.org/abs/2106.02770

0 commit comments

Comments
 (0)