|
1 |
| ---- |
2 |
| -id: acquisition |
3 |
| -title: Acquisition Functions |
4 |
| ---- |
5 |
| - |
6 |
| -Acquisition functions are heuristics employed to evaluate the usefulness of one |
7 |
| -of more design points for achieving the objective of maximizing the underlying |
8 |
| -black box function. |
9 |
| - |
10 |
| -BoTorch supports both analytic as well as (quasi-) Monte-Carlo based acquisition |
11 |
| -functions. It provides a generic |
12 |
| -[`AcquisitionFunction`](../api/acquisition.html#acquisitionfunction) API that |
13 |
| -abstracts away from the particular type, so that optimization can be performed |
14 |
| -on the same objects. |
15 |
| - |
16 |
| - |
17 |
| -## Monte Carlo Acquisition Functions |
18 |
| - |
19 |
| -Many common acquisition functions can be expressed as the expectation of some |
20 |
| -real-valued function of the model output(s) at the design point(s): |
21 |
| - |
22 |
| -$$ |
23 |
| -\alpha(X) = \mathbb{E}\bigl[ a(\xi) \mid |
24 |
| - \xi \sim \mathbb{P}(f(X) \mid \mathcal{D}) \bigr] |
25 |
| -$$ |
26 |
| - |
27 |
| -where $X = (x_1, \dotsc, x_q)$, and $\mathbb{P}(f(X) \mid \mathcal{D})$ is the |
28 |
| -posterior distribution of the function $f$ at $X$ given the data $\mathcal{D}$ |
29 |
| -observed so far. |
30 |
| - |
31 |
| -Evaluating the acquisition function thus requires evaluating an integral over |
32 |
| -the posterior distribution. In most cases, this is analytically intractable. In |
33 |
| -particular, analytic expressions generally do not exist for batch acquisition |
34 |
| -functions that consider multiple design points jointly (i.e. $q > 1$). |
35 |
| - |
36 |
| -An alternative is to use Monte-Carlo (MC) sampling to approximate the integrals. |
37 |
| -An MC approximation of $\alpha$ at $X$ using $N$ MC samples is |
38 |
| - |
39 |
| -$$ \alpha(X) \approx \frac{1}{N} \sum_{i=1}^N a(\xi_{i}) $$ |
40 |
| - |
41 |
| -where $\xi_i \sim \mathbb{P}(f(X) \mid \mathcal{D})$. |
42 |
| - |
43 |
| -For instance, for q-Expected Improvement (qEI), we have: |
44 |
| - |
45 |
| -$$ |
46 |
| -\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q} |
47 |
| -\bigl\\{ \max(\xi_{ij} - f^\*, 0) \bigr\\}, |
48 |
| -\qquad \xi_{i} \sim \mathbb{P}(f(X) \mid \mathcal{D}) |
49 |
| -$$ |
50 |
| - |
51 |
| -where $f^\*$ is the best function value observed so far (assuming noiseless |
52 |
| -observations). Using the reparameterization trick ([^KingmaWelling2014], |
53 |
| -[^Rezende2014]), |
54 |
| - |
55 |
| -$$ |
56 |
| -\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q} |
57 |
| -\bigl\\{ \max\bigl( \mu(X)\_j + (L(X) \epsilon_i)\_j - f^\*, 0 \bigr) \bigr\\}, |
58 |
| -\qquad \epsilon_{i} \sim \mathcal{N}(0, I) |
59 |
| -$$ |
60 |
| - |
61 |
| -where $\mu(X)$ is the posterior mean of $f$ at $X$, and $L(X)L(X)^T = \Sigma(X)$ |
62 |
| -is a root decomposition of the posterior covariance matrix. |
63 |
| - |
64 |
| -All MC-based acquisition functions in BoTorch are derived from |
65 |
| -[`MCAcquisitionFunction`](../api/acquisition.html#mcacquisitionfunction). |
66 |
| - |
67 |
| -Acquisition functions expect input tensors $X$ of shape |
68 |
| -$\textit{batch_shape} \times q \times d$, where $d$ is the dimension of the |
69 |
| -feature space, $q$ is the number of points considered jointly, and |
70 |
| -$\textit{batch_shape}$ is the batch-shape of the input tensor. The output |
71 |
| -$\alpha(X)$ will have shape $\textit{batch_shape}$, with each element |
72 |
| -corresponding to the respective $q \times d$ batch tensor in the input $X$. |
73 |
| -Note that for analytic acquisition functions, it must be that $q=1$. |
74 |
| - |
75 |
| -### MC, q-MC, and Fixed Base Samples |
76 |
| - |
77 |
| -BoTorch relies on the re-parameterization trick and (quasi)-Monte-Carlo sampling |
78 |
| -for optimization and estimation of the batch acquisition functions [^Wilson2017]. |
79 |
| -The results below show the reduced variance when estimating an expected |
80 |
| -improvement (EI) acquisition function using base samples obtained via quasi-MC |
81 |
| -sampling versus standard MC sampling. |
82 |
| - |
83 |
| - |
84 |
| - |
85 |
| -In the plots above, the base samples used to estimate each point are resampled. |
86 |
| -As discussed in the [Overview](./overview), a single set of base samples can be |
87 |
| -used for optimization when the re-parameterization trick is employed. What are the |
88 |
| -trade-offs between using a fixed set of base samples versus re-sampling on every |
89 |
| -MC evaluation of the acquisition function? Below, we show that fixing base samples |
90 |
| -produces functions that are potentially much easier to optimize, without resorting to |
91 |
| -stochastic optimization methods. |
92 |
| - |
93 |
| - |
94 |
| - |
95 |
| -If the base samples are fixed, the problem of optimizing the acquisition function |
96 |
| -is deterministic, allowing for conventional quasi-second order methods such as |
97 |
| -L-BFGS or sequential least-squares programming (SLSQP) to be used. These have |
98 |
| -faster convergence rates than first-order methods and can speed up acquisition |
99 |
| -function optimization significantly. |
100 |
| - |
101 |
| -One concern is that the approximated acquisition function is *biased* for any |
102 |
| -fixed set of base samples, which may adversely affect the solution. However, we |
103 |
| -find that in practice, both the optimal value and the optimal solution of these |
104 |
| -biased problems for standard acquisition functions converge quite rapidly to |
105 |
| -their true counterparts as more samples are used. Note that for evaluation of |
106 |
| -the acquisition function we integrate over a $qo$-dimensional space (where |
107 |
| -$q$ is the number of points in the q-batch and $o$ is the number of outputs |
108 |
| -included in the objective). Therefore, the MC integration problem can be quite |
109 |
| -low-dimensional even for models on high-dimensional feature spaces (large $d$). |
110 |
| -Because using additional samples is relatively cheap computationally, |
111 |
| -we default to 500 base samples in the MC acquisition functions. |
112 |
| - |
113 |
| -On the other hand, when re-sampling is used in conjunction with a stochastic |
114 |
| -optimization algorithm, the kind of bias noted above is no longer a concern. |
115 |
| -The trade-off here is that the optimization may be less effective, as discussed |
116 |
| -above. |
117 |
| - |
118 |
| - |
119 |
| -## Analytic Acquisition Functions |
120 |
| - |
121 |
| -BoTorch also provides implementations of analytic acquisition functions that |
122 |
| -do not depend on MC sampling. These acquisition functions are subclasses of |
123 |
| -[`AnalyticAcquisitionFunction`](../api/acquisition.html#analyticacquisitionfunction) |
124 |
| -and only exist for the case of a single candidate point ($q = 1$). These |
125 |
| -include classical acquisition functions such as Expected Improvement (EI), |
126 |
| -Upper Confidence Bound (UCB), and Probability of Improvement (PI). An example |
127 |
| -comparing [`ExpectedImprovement`](../api/acquisition.html#expectedimprovement), |
128 |
| -the analytic version of EI, to it's MC counterpart |
129 |
| -[`qExpectedImprovement`](../api/acquisition.html#qexpectedimprovement) |
130 |
| -can be found in |
131 |
| -[this tutorial](../tutorials/compare_mc_analytic_acquisition). |
132 |
| - |
133 |
| -Analytic acquisition functions allow for an explicit expression in terms of the |
134 |
| -summary statistics of the posterior distribution at the evaluated point(s). |
135 |
| -A popular acquisition function is Expected Improvement of a single point |
136 |
| -for a Gaussian posterior, given by |
137 |
| - |
138 |
| -$$ \text{EI}(x) = \mathbb{E}\bigl[ |
139 |
| -\max(y - f^\*, 0) \mid y\sim \mathcal{N}(\mu(x), \sigma^2(x)) |
140 |
| -\bigr] $$ |
141 |
| - |
142 |
| -where $\mu(x)$ and $\sigma(x)$ are the posterior mean and variance of $f$ at the |
143 |
| -point $x$, and $f^\*$ is again the best function value observed so far (assuming |
144 |
| -noiseless observations). It can be shown that |
145 |
| - |
146 |
| -$$ \text{EI}(x) = \sigma(x) \bigl( z \Phi(z) + \varphi(z) \bigr)$$ |
147 |
| - |
148 |
| -where $z = \frac{\mu(x) - f_{\max}}{\sigma(x)}$ and $\Phi$ and $\varphi$ are |
149 |
| -the cdf and pdf of the standard normal distribution, respectively. |
150 |
| - |
151 |
| -With some additional work, it is also possible to express the gradient of |
152 |
| -the Expected Improvement with respect to the design $x$. Classic Bayesian |
153 |
| -Optimization software will implement this gradient function explicitly, so that |
154 |
| -it can be used for numerically optimizing the acquisition function. |
155 |
| - |
156 |
| -BoTorch, in contrast, harnesses PyTorch's automatic differentiation feature |
157 |
| -("autograd") in order to obtain gradients of acquisition functions. This makes |
158 |
| -implementing new acquisition functions much less cumbersome, as it does not |
159 |
| -require to analytically derive gradients. All that is required is that the |
160 |
| -operations performed in the acquisition function computation allow for the |
161 |
| -back-propagation of gradient information through the posterior and the model. |
162 |
| - |
163 |
| - |
164 |
| -[^KingmaWelling2014]: D. P. Kingma, M. Welling. Auto-Encoding Variational Bayes. |
165 |
| -ICLR, 2013. |
166 |
| - |
167 |
| -[^Rezende2014]: D. J. Rezende, S. Mohamed, D. Wierstra. Stochastic |
168 |
| -Backpropagation and Approximate Inference in Deep Generative Models. ICML, 2014. |
169 |
| - |
170 |
| -[^Wilson2017]: J. T. Wilson, R. Moriconi, F. Hutter, M. P. Deisenroth. |
171 |
| -The Reparameterization Trick for Acquisition Functions. NeurIPS Workshop on |
172 |
| -Bayesian Optimization, 2017. |
| 1 | +--- |
| 2 | +id: acquisition |
| 3 | +title: Acquisition Functions |
| 4 | +--- |
| 5 | + |
| 6 | +Acquisition functions are heuristics employed to evaluate the usefulness of one |
| 7 | +of more design points for achieving the objective of maximizing the underlying |
| 8 | +black box function. |
| 9 | + |
| 10 | +BoTorch supports both analytic as well as (quasi-) Monte-Carlo based acquisition |
| 11 | +functions. It provides a generic |
| 12 | +[`AcquisitionFunction`](../api/acquisition.html#acquisitionfunction) API that |
| 13 | +abstracts away from the particular type, so that optimization can be performed |
| 14 | +on the same objects. |
| 15 | + |
| 16 | + |
| 17 | +## Monte Carlo Acquisition Functions |
| 18 | + |
| 19 | +Many common acquisition functions can be expressed as the expectation of some |
| 20 | +real-valued function of the model output(s) at the design point(s): |
| 21 | + |
| 22 | +$$ |
| 23 | +\alpha(X) = \mathbb{E}\bigl[ a(\xi) \mid |
| 24 | + \xi \sim \mathbb{P}(f(X) \mid \mathcal{D}) \bigr] |
| 25 | +$$ |
| 26 | + |
| 27 | +where $X = (x_1, \dotsc, x_q)$, and $\mathbb{P}(f(X) \mid \mathcal{D})$ is the |
| 28 | +posterior distribution of the function $f$ at $X$ given the data $\mathcal{D}$ |
| 29 | +observed so far. |
| 30 | + |
| 31 | +Evaluating the acquisition function thus requires evaluating an integral over |
| 32 | +the posterior distribution. In most cases, this is analytically intractable. In |
| 33 | +particular, analytic expressions generally do not exist for batch acquisition |
| 34 | +functions that consider multiple design points jointly (i.e. $q > 1$). |
| 35 | + |
| 36 | +An alternative is to use Monte-Carlo (MC) sampling to approximate the integrals. |
| 37 | +An MC approximation of $\alpha$ at $X$ using $N$ MC samples is |
| 38 | + |
| 39 | +$$ \alpha(X) \approx \frac{1}{N} \sum_{i=1}^N a(\xi_{i}) $$ |
| 40 | + |
| 41 | +where $\xi_i \sim \mathbb{P}(f(X) \mid \mathcal{D})$. |
| 42 | + |
| 43 | +For instance, for q-Expected Improvement (qEI), we have: |
| 44 | + |
| 45 | +$$ |
| 46 | +\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q} |
| 47 | +\bigl\\{ \max(\xi_{ij} - f^\*, 0) \bigr\\}, |
| 48 | +\qquad \xi_{i} \sim \mathbb{P}(f(X) \mid \mathcal{D}) |
| 49 | +$$ |
| 50 | + |
| 51 | +where $f^\*$ is the best function value observed so far (assuming noiseless |
| 52 | +observations). Using the reparameterization trick ([^KingmaWelling2014], |
| 53 | +[^Rezende2014]), |
| 54 | + |
| 55 | +$$ |
| 56 | +\text{qEI}(X) \approx \frac{1}{N} \sum_{i=1}^N \max_{j=1,..., q} |
| 57 | +\bigl\\{ \max\bigl( \mu(X)\_j + (L(X) \epsilon_i)\_j - f^\*, 0 \bigr) \bigr\\}, |
| 58 | +\qquad \epsilon_{i} \sim \mathcal{N}(0, I) |
| 59 | +$$ |
| 60 | + |
| 61 | +where $\mu(X)$ is the posterior mean of $f$ at $X$, and $L(X)L(X)^T = \Sigma(X)$ |
| 62 | +is a root decomposition of the posterior covariance matrix. |
| 63 | + |
| 64 | +All MC-based acquisition functions in BoTorch are derived from |
| 65 | +[`MCAcquisitionFunction`](../api/acquisition.html#mcacquisitionfunction). |
| 66 | + |
| 67 | +Acquisition functions expect input tensors $X$ of shape |
| 68 | +$\textit{batch_shape} \times q \times d$, where $d$ is the dimension of the |
| 69 | +feature space, $q$ is the number of points considered jointly, and |
| 70 | +$\textit{batch_shape}$ is the batch-shape of the input tensor. The output |
| 71 | +$\alpha(X)$ will have shape $\textit{batch_shape}$, with each element |
| 72 | +corresponding to the respective $q \times d$ batch tensor in the input $X$. |
| 73 | +Note that for analytic acquisition functions, it must be that $q=1$. |
| 74 | + |
| 75 | +### MC, q-MC, and Fixed Base Samples |
| 76 | + |
| 77 | +BoTorch relies on the re-parameterization trick and (quasi)-Monte-Carlo sampling |
| 78 | +for optimization and estimation of the batch acquisition functions [^Wilson2017]. |
| 79 | +The results below show the reduced variance when estimating an expected |
| 80 | +improvement (EI) acquisition function using base samples obtained via quasi-MC |
| 81 | +sampling versus standard MC sampling. |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | +In the plots above, the base samples used to estimate each point are resampled. |
| 86 | +As discussed in the [Overview](./overview), a single set of base samples can be |
| 87 | +used for optimization when the re-parameterization trick is employed. What are the |
| 88 | +trade-offs between using a fixed set of base samples versus re-sampling on every |
| 89 | +MC evaluation of the acquisition function? Below, we show that fixing base samples |
| 90 | +produces functions that are potentially much easier to optimize, without resorting to |
| 91 | +stochastic optimization methods. |
| 92 | + |
| 93 | + |
| 94 | + |
| 95 | +If the base samples are fixed, the problem of optimizing the acquisition function |
| 96 | +is deterministic, allowing for conventional quasi-second order methods such as |
| 97 | +L-BFGS or sequential least-squares programming (SLSQP) to be used. These have |
| 98 | +faster convergence rates than first-order methods and can speed up acquisition |
| 99 | +function optimization significantly. |
| 100 | + |
| 101 | +One concern is that the approximated acquisition function is *biased* for any |
| 102 | +fixed set of base samples, which may adversely affect the solution. However, we |
| 103 | +find that in practice, both the optimal value and the optimal solution of these |
| 104 | +biased problems for standard acquisition functions converge quite rapidly to |
| 105 | +their true counterparts as more samples are used. Note that for evaluation of |
| 106 | +the acquisition function we integrate over a $qo$-dimensional space (where |
| 107 | +$q$ is the number of points in the q-batch and $o$ is the number of outputs |
| 108 | +included in the objective). Therefore, the MC integration problem can be quite |
| 109 | +low-dimensional even for models on high-dimensional feature spaces (large $d$). |
| 110 | +Because using additional samples is relatively cheap computationally, |
| 111 | +we default to 500 base samples in the MC acquisition functions. |
| 112 | + |
| 113 | +On the other hand, when re-sampling is used in conjunction with a stochastic |
| 114 | +optimization algorithm, the kind of bias noted above is no longer a concern. |
| 115 | +The trade-off here is that the optimization may be less effective, as discussed |
| 116 | +above. |
| 117 | + |
| 118 | + |
| 119 | +## Analytic Acquisition Functions |
| 120 | + |
| 121 | +BoTorch also provides implementations of analytic acquisition functions that |
| 122 | +do not depend on MC sampling. These acquisition functions are subclasses of |
| 123 | +[`AnalyticAcquisitionFunction`](../api/acquisition.html#analyticacquisitionfunction) |
| 124 | +and only exist for the case of a single candidate point ($q = 1$). These |
| 125 | +include classical acquisition functions such as Expected Improvement (EI), |
| 126 | +Upper Confidence Bound (UCB), and Probability of Improvement (PI). An example |
| 127 | +comparing [`ExpectedImprovement`](../api/acquisition.html#expectedimprovement), |
| 128 | +the analytic version of EI, to it's MC counterpart |
| 129 | +[`qExpectedImprovement`](../api/acquisition.html#qexpectedimprovement) |
| 130 | +can be found in |
| 131 | +[this tutorial](../tutorials/compare_mc_analytic_acquisition). |
| 132 | + |
| 133 | +Analytic acquisition functions allow for an explicit expression in terms of the |
| 134 | +summary statistics of the posterior distribution at the evaluated point(s). |
| 135 | +A popular acquisition function is Expected Improvement of a single point |
| 136 | +for a Gaussian posterior, given by |
| 137 | + |
| 138 | +$$ \text{EI}(x) = \mathbb{E}\bigl[ |
| 139 | +\max(y - f^\*, 0) \mid y\sim \mathcal{N}(\mu(x), \sigma^2(x)) |
| 140 | +\bigr] $$ |
| 141 | + |
| 142 | +where $\mu(x)$ and $\sigma(x)$ are the posterior mean and variance of $f$ at the |
| 143 | +point $x$, and $f^\*$ is again the best function value observed so far (assuming |
| 144 | +noiseless observations). It can be shown that |
| 145 | + |
| 146 | +$$ \text{EI}(x) = \sigma(x) \bigl( z \Phi(z) + \varphi(z) \bigr)$$ |
| 147 | + |
| 148 | +where $z = \frac{\mu(x) - f_{\max}}{\sigma(x)}$ and $\Phi$ and $\varphi$ are |
| 149 | +the cdf and pdf of the standard normal distribution, respectively. |
| 150 | + |
| 151 | +With some additional work, it is also possible to express the gradient of |
| 152 | +the Expected Improvement with respect to the design $x$. Classic Bayesian |
| 153 | +Optimization software will implement this gradient function explicitly, so that |
| 154 | +it can be used for numerically optimizing the acquisition function. |
| 155 | + |
| 156 | +BoTorch, in contrast, harnesses PyTorch's automatic differentiation feature |
| 157 | +("autograd") in order to obtain gradients of acquisition functions. This makes |
| 158 | +implementing new acquisition functions much less cumbersome, as it does not |
| 159 | +require to analytically derive gradients. All that is required is that the |
| 160 | +operations performed in the acquisition function computation allow for the |
| 161 | +back-propagation of gradient information through the posterior and the model. |
| 162 | + |
| 163 | + |
| 164 | +[^KingmaWelling2014]: D. P. Kingma, M. Welling. Auto-Encoding Variational Bayes. |
| 165 | +ICLR, 2013. |
| 166 | + |
| 167 | +[^Rezende2014]: D. J. Rezende, S. Mohamed, D. Wierstra. Stochastic |
| 168 | +Backpropagation and Approximate Inference in Deep Generative Models. ICML, 2014. |
| 169 | + |
| 170 | +[^Wilson2017]: J. T. Wilson, R. Moriconi, F. Hutter, M. P. Deisenroth. |
| 171 | +The Reparameterization Trick for Acquisition Functions. NeurIPS Workshop on |
| 172 | +Bayesian Optimization, 2017. |
| 173 | + |
| 174 | +## Latent Information Gain |
| 175 | + |
| 176 | +In the high-dimensional spatiotemporal domain, Expected Information Gain becomes |
| 177 | +less informative for useful observations, and it can be difficult to calculate |
| 178 | +its parameters. To overcome these limitations, we propose a novel acquisition |
| 179 | +function by computing the expected information gain in the latent space rather |
| 180 | +than the observational space. To design this acquisition function, |
| 181 | +we prove the equivalence between the expected information gain |
| 182 | +in the observational space and the expected KL divergence in the |
| 183 | +latent processes w.r.t. a candidate parameter 𝜃, as illustrated by the |
| 184 | +following proposition. |
| 185 | + |
| 186 | +Proposition 1. The expected information gain (EIG) for Neural |
| 187 | +Process is equivalent to the KL divergence between the prior and |
| 188 | +posterior in the latent process, that is |
| 189 | + |
| 190 | +$$ \text{EIG}(\hat{x}_{1:T}, \theta) := \mathbb{E} \left[ H(\hat{x}_{1:T}) - |
| 191 | +H(\hat{x}_{1:T} \mid z_{1:T}, \theta) \right] |
| 192 | += \mathbb{E}_{p(\hat{x}_{1:T} \mid \theta)} |
| 193 | +\text{KL} \left( p(z_{1:T} \mid \hat{x}_{1:T}, \theta) \,\|\, p(z_{1:T}) \right) |
| 194 | +$$ |
| 195 | + |
| 196 | + |
| 197 | +Inspired by this fact, we propose a novel acquisition function computing the |
| 198 | +expected KL divergence in the latent processes and name it LIG. Specifically, |
| 199 | +the trained NP model produces a variational posterior given the current dataset. |
| 200 | +For every parameter $$\theta$$ remained in the search space, we can predict |
| 201 | +$$\hat{x}_{1:T}$$ with the decoder. We use $$\hat{x}_{1:T}$$ and $$\theta$$ |
| 202 | +as input to the encoder to re-evaluate the posterior. LIG computes the |
| 203 | +distributional difference with respect to the latent process. |
| 204 | +[Wu2023arxiv]: |
| 205 | + Wu, D., Niu, R., Chinazzi, M., Vespignani, A., Ma, Y.-A., & Yu, R. (2023). |
| 206 | + Deep Bayesian Active Learning for Accelerating Stochastic Simulation. |
| 207 | + arXiv preprint arXiv:2106.02770. Retrieved from https://arxiv.org/abs/2106.02770 |
0 commit comments