Skip to content

Commit 98b5395

Browse files
committed
Minor updates and rename of laplace approximation algorithm (design-doc #16)
1 parent 914a47f commit 98b5395

File tree

1 file changed

+30
-33
lines changed

1 file changed

+30
-33
lines changed

designs/0016-hessian_optimize_cmdstan.md designs/0016-laplace_approximation_algorithm.md

+30-33
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,8 @@ form of approximate inference.
1212
# Motivation
1313
[motivation]: #motivation
1414

15-
I have a Stan model that is too slow to sample. I would like to do something
16-
better than optimization. Laplace approximations are a pretty standard way
17-
of doing this.
15+
Laplace approximations are quick to compute. If a model can be fit with a Laplace
16+
approximation, that will probably be much faster than running full hmc.
1817

1918
# Guide-level explanation
2019
[guide-level-explanation]: #guide-level-explanation
@@ -33,7 +32,7 @@ where `jac(g, u)` is the Jacobian of the function `g` at `u`. In the Laplace
3332
approximation, we search for a mode (a maximum) of ```log(p(u))```. Call this
3433
`u_mode`. This is not the same optimization that is done in the `optimizing`
3534
algorithm. That searches for a mode of `log(p(g(u)))` (or the equation above
36-
without the `log(det(jac(g, u)))` term. This is different.
35+
without the `log(det(jac(g, u)))` term). This is different.
3736

3837
We can form a second order Taylor expansion of `log(p(u))` around `u_mode`:
3938

@@ -44,10 +43,10 @@ log(p(u)) = log(p(u_mode))
4443
+ O(||u - u_mode||^3)
4544
```
4645

47-
where `gradient(log(p), u)` is the gradient of `log(p)` at `u` and
48-
`hessian(log(p), u)` is the hessian of `log(p)` at `u`. Because the gradient
49-
is zero at the mode, the linear term drops out. Ignoring the third order
50-
terms gives us a new distribution `p_approx(u)`:
46+
where `gradient(log(p), u_mode)` is the gradient of `log(p(u))` at `u_mode` and
47+
`hessian(log(p), u_mode)` is the hessian of `log(p(u))` at `u_mode`. Because the
48+
gradient is zero at the mode, the linear term drops out. Ignoring the third
49+
order terms gives us a new distribution `p_approx(u)`:
5150

5251
```
5352
log(p_approx(u)) = K + 0.5 * (u - u_mode)^T * hessian(log(p), u_mode) * (u - u_mode)
@@ -59,9 +58,10 @@ where K is a constant to make this normalize. `u_approx` (`u` sampled from
5958
u_approx ~ N(u_mode, -(hessian(log(p), u_mode))^{-1})
6059
```
6160

62-
Taking draws from `u_approx` gives us draws from our distribution on Stan's
63-
unconstrained space. Once constrained, these draws can be used in the same
64-
way that regular draws from the `sampling` algorithm are used.
61+
Taking draws from `u_approx` gives us draws from the distribution on the
62+
unconstrained space. Once constrained (via the transform `g(u)`), these draws
63+
can be used in the same way that regular draws from the `sampling` algorithm
64+
are used.
6565

6666
The `laplace` algorithm would take in the same arguments as the `optimize`
6767
algorithm plus two additional ones:
@@ -79,7 +79,7 @@ A model can be called by:
7979
./model laplace num_samples=100 data file=data.dat
8080
```
8181

82-
or with the diagonal:
82+
or with a small value added to the diagonal:
8383
```
8484
./model laplace num_samples=100 add_diag=1e-10 data file=data.dat
8585
```
@@ -92,7 +92,10 @@ The three algorithm specific parameters are:
9292
1. ```log_p__``` - The log density of the model itself
9393
2. ```log_g__``` - The log density of the Laplace approximation
9494
3. ```rejected__``` - A boolean data indicating whether it was possible to
95-
evaluate the log density of the model at this parameter value
95+
evaluate the log density of the model at this parameter value
96+
97+
Reporting both the model and approximate log density allows for importance
98+
sampling diagnostics to be applied to the model.
9699

97100
For instance, the new output might look like:
98101

@@ -106,15 +109,24 @@ log_p__, log_g__, rejected__, b.1,b.2
106109
-3, -4, 1, 0, 0
107110
```
108111

112+
There are two additional diagnostics for the `laplace` approximation. The
113+
`log_p__` and `log_q__` outputs can be used to do the Pareto-k
114+
diagnostic from "Pareto Smoothed Importance Sampling" (Vehtari, 2015).
115+
116+
There could also be a diagnostic printout for how many times it is not
117+
possible to evaluate the log density of the model at a certain point from the
118+
approximate posterior (this would indicate the approximation is likely
119+
questionable).
120+
109121
# Reference-level explanation
110122
[reference-level-explanation]: #reference-level-explanation
111123

112124
The implementation of this would borrow heavily from the optimization code. The
113125
difference would be that the Jacobian would be turned on for the optimization.
114126

115-
We will also need to implement a way of computing Hessians with finite
116-
differences of gradients. Simple finite differences were not sufficiently
117-
accurate for an example I was working on.
127+
The Hessians should be computed using finite differences of reverse mode autodiff.
128+
Higher order autodiff isn't possible for general Stan models, but finite differences
129+
of reverse mode autodiff should be more stable than second order finite differences.
118130

119131
# Drawbacks
120132
[drawbacks]: #drawbacks
@@ -137,24 +149,9 @@ lengths of their outputs (they may compute incorrect standard errors, etc).
137149
print zeros where usually the parameters would be.
138150

139151
If the user does not check the `rejected__` column, then they will be using a
140-
bunch of zeros in their follow-up calculations that could mislead them.
141-
142-
Similarly to divergent transitions, we can print to the output information about
143-
how many draws were rejected in any case.
144-
145-
In the earlier part of the design document I assume #3 was implemented, but I
146-
think #2 might be better.
147-
148-
# Rationale and alternatives
149-
[rationale-and-alternatives]: #rationale-and-alternatives
150-
151-
An alternative that seems appealing at first is printing out the mode and
152-
hessian so that it would be possible for people to make their own posterior
153-
draws. This is not reasonable because the conversion from unconstrained to
154-
constrained space is not exposed in all of the interfaces.
152+
bunch of zeros in their follow-up calculations that could be misleading.
155153

156154
# Prior art
157155
[prior-art]: #prior-art
158156

159-
This is a pretty common Bayesian approximation. It just is not implemented in
160-
Stan.
157+
This is a common Bayesian approximation. It just is not implemented in Stan.

0 commit comments

Comments
 (0)