@@ -12,9 +12,8 @@ form of approximate inference.
12
12
# Motivation
13
13
[ motivation ] : #motivation
14
14
15
- I have a Stan model that is too slow to sample. I would like to do something
16
- better than optimization. Laplace approximations are a pretty standard way
17
- of doing this.
15
+ Laplace approximations are quick to compute. If a model can be fit with a Laplace
16
+ approximation, that will probably be much faster than running full hmc.
18
17
19
18
# Guide-level explanation
20
19
[ guide-level-explanation ] : #guide-level-explanation
@@ -33,7 +32,7 @@ where `jac(g, u)` is the Jacobian of the function `g` at `u`. In the Laplace
33
32
approximation, we search for a mode (a maximum) of ``` log(p(u)) ``` . Call this
34
33
` u_mode ` . This is not the same optimization that is done in the ` optimizing `
35
34
algorithm. That searches for a mode of ` log(p(g(u))) ` (or the equation above
36
- without the ` log(det(jac(g, u))) ` term. This is different.
35
+ without the ` log(det(jac(g, u))) ` term) . This is different.
37
36
38
37
We can form a second order Taylor expansion of ` log(p(u)) ` around ` u_mode ` :
39
38
@@ -44,10 +43,10 @@ log(p(u)) = log(p(u_mode))
44
43
+ O(||u - u_mode||^3)
45
44
```
46
45
47
- where ` gradient(log(p), u ) ` is the gradient of ` log(p) ` at ` u ` and
48
- ` hessian(log(p), u ) ` is the hessian of ` log(p) ` at ` u ` . Because the gradient
49
- is zero at the mode, the linear term drops out. Ignoring the third order
50
- terms gives us a new distribution ` p_approx(u) ` :
46
+ where ` gradient(log(p), u_mode ) ` is the gradient of ` log(p(u)) ` at ` u_mode ` and
47
+ ` hessian(log(p), u_mode ) ` is the hessian of ` log(p(u)) ` at ` u_mode ` . Because the
48
+ gradient is zero at the mode, the linear term drops out. Ignoring the third
49
+ order terms gives us a new distribution ` p_approx(u) ` :
51
50
52
51
```
53
52
log(p_approx(u)) = K + 0.5 * (u - u_mode)^T * hessian(log(p), u_mode) * (u - u_mode)
@@ -59,9 +58,10 @@ where K is a constant to make this normalize. `u_approx` (`u` sampled from
59
58
u_approx ~ N(u_mode, -(hessian(log(p), u_mode))^{-1})
60
59
```
61
60
62
- Taking draws from ` u_approx ` gives us draws from our distribution on Stan's
63
- unconstrained space. Once constrained, these draws can be used in the same
64
- way that regular draws from the ` sampling ` algorithm are used.
61
+ Taking draws from ` u_approx ` gives us draws from the distribution on the
62
+ unconstrained space. Once constrained (via the transform ` g(u) ` ), these draws
63
+ can be used in the same way that regular draws from the ` sampling ` algorithm
64
+ are used.
65
65
66
66
The ` laplace ` algorithm would take in the same arguments as the ` optimize `
67
67
algorithm plus two additional ones:
@@ -79,7 +79,7 @@ A model can be called by:
79
79
./model laplace num_samples=100 data file=data.dat
80
80
```
81
81
82
- or with the diagonal:
82
+ or with a small value added to the diagonal:
83
83
```
84
84
./model laplace num_samples=100 add_diag=1e-10 data file=data.dat
85
85
```
@@ -92,7 +92,10 @@ The three algorithm specific parameters are:
92
92
1 . ``` log_p__ ``` - The log density of the model itself
93
93
2 . ``` log_g__ ``` - The log density of the Laplace approximation
94
94
3 . ``` rejected__ ``` - A boolean data indicating whether it was possible to
95
- evaluate the log density of the model at this parameter value
95
+ evaluate the log density of the model at this parameter value
96
+
97
+ Reporting both the model and approximate log density allows for importance
98
+ sampling diagnostics to be applied to the model.
96
99
97
100
For instance, the new output might look like:
98
101
@@ -106,15 +109,24 @@ log_p__, log_g__, rejected__, b.1,b.2
106
109
-3, -4, 1, 0, 0
107
110
```
108
111
112
+ There are two additional diagnostics for the ` laplace ` approximation. The
113
+ ` log_p__ ` and ` log_q__ ` outputs can be used to do the Pareto-k
114
+ diagnostic from "Pareto Smoothed Importance Sampling" (Vehtari, 2015).
115
+
116
+ There could also be a diagnostic printout for how many times it is not
117
+ possible to evaluate the log density of the model at a certain point from the
118
+ approximate posterior (this would indicate the approximation is likely
119
+ questionable).
120
+
109
121
# Reference-level explanation
110
122
[ reference-level-explanation ] : #reference-level-explanation
111
123
112
124
The implementation of this would borrow heavily from the optimization code. The
113
125
difference would be that the Jacobian would be turned on for the optimization.
114
126
115
- We will also need to implement a way of computing Hessians with finite
116
- differences of gradients. Simple finite differences were not sufficiently
117
- accurate for an example I was working on .
127
+ The Hessians should be computed using finite differences of reverse mode autodiff.
128
+ Higher order autodiff isn't possible for general Stan models, but finite differences
129
+ of reverse mode autodiff should be more stable than second order finite differences .
118
130
119
131
# Drawbacks
120
132
[ drawbacks ] : #drawbacks
@@ -137,24 +149,9 @@ lengths of their outputs (they may compute incorrect standard errors, etc).
137
149
print zeros where usually the parameters would be.
138
150
139
151
If the user does not check the ` rejected__ ` column, then they will be using a
140
- bunch of zeros in their follow-up calculations that could mislead them.
141
-
142
- Similarly to divergent transitions, we can print to the output information about
143
- how many draws were rejected in any case.
144
-
145
- In the earlier part of the design document I assume #3 was implemented, but I
146
- think #2 might be better.
147
-
148
- # Rationale and alternatives
149
- [ rationale-and-alternatives ] : #rationale-and-alternatives
150
-
151
- An alternative that seems appealing at first is printing out the mode and
152
- hessian so that it would be possible for people to make their own posterior
153
- draws. This is not reasonable because the conversion from unconstrained to
154
- constrained space is not exposed in all of the interfaces.
152
+ bunch of zeros in their follow-up calculations that could be misleading.
155
153
156
154
# Prior art
157
155
[ prior-art ] : #prior-art
158
156
159
- This is a pretty common Bayesian approximation. It just is not implemented in
160
- Stan.
157
+ This is a common Bayesian approximation. It just is not implemented in Stan.
0 commit comments