So far we've focused on classical inference techniques: asymptotically normal approximations, Wald confidence intervals, etc.
It is tempting to interpret the confidence interval as saying that
To make such a claim, we need to adopt a Bayesian perspective and reason about the posterior distribution of the parameters,
-
$p(\theta \mid x)$ is the posterior, -
$p(x \mid \theta)$ is the likelihood, -
$p(\theta)$ is the prior, and -
$p(x) = \int p(x \mid \theta) , p(\theta) \dif \theta$ is the marginal likelihood.
Once we have the posterior distribution, then we're really in business. Often, we are particularly interested in posterior expectations, like:
-
$\E_{p(\theta | x)}[\theta]$ , the posterior mean, -
$\E_{p(\theta | x)}[\bbI[\theta \in \cA]]$ , the probability of the parameters being in set$\cA$ , -
$\E_{p(\theta | x)}[p(x' \mid \theta)]$ , the posterior predictive density of new data$x'$ .
All of these can be written as
For point estimation, we may choose the mode,
We can also obtain an analogue of frequentist confidence intervals by summarizing the posterior in terms of a Bayesian credible interval: a set of parameters that captures
The posterior distribution depends on the choice of prior. Indeed, the subjective choice of prior distributions is the source of much of the criticism of Bayesian approaches. In cases where we truly know nothing about the parameter a priori, we can often specify "weak" or "uninformative" prior distributions. Under such assumptions, we'll find that Bayesian and frequentist approaches can yield similar estimates, with the advantage that the Bayesian credible interval admits the intuitive interpretation as a set where
When it comes to choosing a prior distribution, one desiderata is computational tractability. The hard part of Bayesian inference is typically integration: to normalize the posterior we need to compute the marginal likelihood, which is an integral over the parameter space; to compute posterior expectations, we need to do the same. Conjugate priors are distributions on
:::{admonition} Example: Bernoulli Likelihood with a Beta Prior :class: tip
The beta distribution is a conjugate prior for a Bernoulli likelihood,
\begin{align*}
\theta &\sim \mathrm{Beta}(\alpha, \beta)
\end{align*}
with support on
Under the beta prior, the posterior distribution over
The posterior mode — i.e., the maximum a posteriori (MAP) estimate — is \begin{align*} \hat{\theta}{\mathsf{MAP}} &= \frac{x + \alpha - 1}{n + \alpha + \beta - 2}, \end{align*} and under an uninformative prior with $\alpha = \beta = 1$, it is equivalent to the MLE, $\hat{\theta}{\mathsf{MLE}} = x / n$.
Bayesian credible intervals can be derived using the cumulative distribution function (cdf) of the beta distribution, which is given by the incomplete beta function.
In the large sample limit, the beta posterior is approximately Gaussian.
The variance of the posterior beta distribution is,
\begin{align*}
\Var[\theta \mid X]
&= \frac{(x + \alpha)(n - x + \beta)}{(n + \alpha + \beta)^2 (n + \alpha + \beta + 1)}
\end{align*}
In this limit,
Consider a general exponential family likelihood with natural parameter
Exponential family distributions have conjugate priors, \begin{align*} p(\theta; \chi, \nu) &\propto \exp \left { \langle \chi, \theta \rangle - \nu A(\theta) \right } \ &= \exp \left{ \langle \chi, \theta \rangle + \langle \nu, -A(\theta) \rangle - B(\chi, \nu) \right}. \end{align*} We recognize the conjugate prior as another exponential family distribution in which,
- the natural parameter
$\chi$ are pseudo-observations of the sufficient statistics (like statistics from fake data points), - the natural parameter
$\nu$ is a pseudo-count (like the number of fake data points), - the prior sufficient statistics are
$(\theta, -A(\theta))$ , - the prior log normalizer is
$B(\chi, \nu)$ , and
With a conjugate prior, the posterior distribution belongs to the same family as the prior,
\begin{align*}
p(\theta \mid {x_i}{i=1}^n; \chi, \nu)
&\propto p(\theta; \chi, \nu) \prod{i=1}^n p(x_i \mid \theta) \
&\propto \exp \left{ \chi + \sum_{i=1}^n t(x_i), \theta \rangle + \langle \nu + n, -A(\theta) \rangle \right} \
&= p(\theta \mid \chi', \nu')
\end{align*}
where
\begin{align*}
\chi' &= \chi + \sum_{i=1}^n t(x_i) \
\nu' &= \nu + n.
\end{align*}
The posterior is a function of two quantities of fixed dimension,
:::{admonition} Questions
- Does each exponential family likelihood have a unique conjugate prior?
- With a conjugate prior, the posterior is just a function of
$\chi'$ and$\nu'$ . Does that make it computationally tractable? - Do conjugate priors exist for likelihoods that are not exponential families? :::
Conjugate priors are a common choice for simple exponential family models, but we need more general approaches for more complex models.
Suppose you wanted to perform Bayesian inference of the weights in a logistic regression model, \begin{align*} p(y \mid x, \mbbeta) &= \prod_{i=1}^n \mathrm{Bern}(y_i \mid \sigma(x_i^\top \mbbeta)). \end{align*} Assume a Gaussian prior, \begin{align*} \mbbeta &\sim \mathrm{N}(\mbzero, \gamma^{-1} \mbI). \end{align*} Unfortunately, the posterior does not have a closed formation solution. Instead, a common form of approximate posterior inference is the Laplace approximation, \begin{align*} p(\mbbeta \mid x, y) &\approx \mathrm{N}(\hat{\mbbeta}{\mathsf{MAP}}, \widehat{\mbSigma}) \end{align*} where \begin{align*} \hat{\mbbeta}{\mathsf{MAP}} &= \arg \max_{\mbbeta} \cL(\mbbeta) \end{align*} is the maximum a posteriori (MAP) estimate, \begin{align*} \widehat{\mbSigma} &= -[\nabla^2 \cL(\hat{\mbbeta}{\mathsf{MAP}})]^{-1} = \cI(\hat{\mbbeta}{\mathsf{MAP}})^{-1} \end{align*} is an approximation of the posterior covariance, and \begin{align*} \cL(\mbbeta) &= \log p(\mbbeta) + \sum_{i=1}^n \log p(y_i \mid x_i, \mbbeta) \ &= \log \mathrm{N}(\mbbeta; \mbzero, \gamma^{-1} \mbI) + \sum_{i=1}^n \log \mathrm{Bern}(y_i \mid \sigma(x_i^\top \mbbeta)) \end{align*} is the log joint probability, not the loss function from previous chapters!
:::{admonition} Question How do posterior credible intervals under the Laplace approximation compare to Wald confidence intervals of the MLE under L2 regularization? :::
In the large data limit (as
Consider a simpler setting in which we have data
Under some conditions (e.g.
Likewise,
\begin{align*}
p(\theta \mid {x_i}_{i=1}^n) \to \mathrm{N} \big(\theta \mid \theta^\star, \tfrac{1}{n} \cI(\theta^\star)^{-1} \big)
\end{align*}
where
Generally, we can't analytically compute posterior expectations. In these cases, we need to resort to approximations. For example, we could use quadrature methods like Simpson's rule or the trapezoid rule to numerically approximate the integral over
Roughly,
\begin{align*}
\E_{p(\theta | x)}[f(\theta)] \approx \sum_{m=1}^M p(\theta_m \mid x) , f(\theta_m) , \Delta_m
\end{align*}
where
This works for low-dimensional problems (say, up to
Idea: approximate the expectation via sampling,
\begin{align*} \E_{p(\theta | x)}[f(\theta)] \approx \frac{1}{M} \sum_{m=1}^M f(\theta_m) \quad \text{where} \quad \theta_m \sim p(\theta \mid x). \end{align*}
Let
Clearly,
\begin{align*} \E[\hat{f}] = \frac{1}{M} \sum_{m=1}^M \E_{p(\theta | x)}[f(\theta)] = \E_{p(\theta | x)}[f(\theta)]. \end{align*}
Thus,
What about its variance? \begin{align*} \Var[\hat{f}] = \Var \left(\frac{1}{M} \sum_{m=1}^M f(\theta_m) \right) = \frac{1}{M^2} \left( \sum_{m=1}^M \Var[f(\theta)] + 2 \sum_{1 \leq m < m' \leq M} \mathrm{Cov} [f(\theta_m), f(\theta_{m'})] \right) \end{align*}
- If the samples are not only identically distributed but also uncorrelated, then
$\Var[\hat{f}] = \frac{1}{M} \Var[f(\theta)]$ . - In this case, the root mean squared error (RMSE) of the estimate is
$\sqrt{\Var[\hat{f}]} = O(M^{-\frac{1}{2}})$ . - Compare this to Simpson's rule, which for smooth 1D problems has an error rate of
$O(M^{-4})$ . That's roughly 8 times better than Monte Carlo! - However, for multidimensional problems, Simpson's rule is
$O(M^{-\frac{4}{D}})$ , whereas the error rate of Monte Carlo does not depend on the dimensionality!
So far so good: we'll just draw a lot of samples to drive down our Monte Carlo error. Here's the catch! How do you draw samples from the posterior
A Markov chain is a joint distribution of a sequence of variables,
- The distribution
$\pi_1(\theta_1)$ is called the initial distribution. - The distribution
$\pi(\theta_m \mid \theta_{m-1})$ is called the transition distribution. If the transition distribution is the same for each$m$ , the Markov chain is homogeneous.
Let
A distribution
How can we relate transition distributions and stationary distributions? A sufficient (but not necessary) condition for
To see that detailed balance is sufficient, integrate both sides to get,
\begin{align*}
\int \pi^\star(\theta') \pi(\theta \mid \theta') \dif \theta' = \int \pi^\star(\theta) \pi(\theta' \mid \theta) \dif \theta' = \pi^\star(\theta).
\end{align*}
Thus,
Detailed balance can be used to show that
The easiest way to prove ergodicity is to show that it is possible to reach any
:::{admonition} Note A more technical definition is that all pairs of sets communicate, in which case the chain is irreducible, and that each state is aperiodic. The definitions can be a bit overwhelming. :::
Finally, we come to our main objective: designing a Markov chain for which the posterior is the unique stationary distribution. That is, we want
Recall our constraint: we can only compute the joint probability (the numerator in Bayes' rule), not the marginal likelihood (the denominator). Fortunately, that still allows us to compute ratios of posterior densities! We have, \begin{align*} \frac{p(\theta \mid x)}{p(\theta' \mid x)} = \frac{p(\theta, x)}{p(x)} \frac{p(x)}{p(\theta', x)} = \frac{p(\theta, x)}{p(\theta', x)}. \end{align*} Now rearrange the detailed balance condition to relate ratios of transition probabilities to ratios of joint probabilities, \begin{align*} \frac{\pi(\theta \mid \theta')}{\pi(\theta' \mid \theta)} = \frac{\pi^\star(\theta)}{\pi^\star(\theta')} = \frac{p(\theta \mid x)}{p(\theta' \mid x)} = \frac{p(\theta, x)}{p(\theta', x)} \end{align*}
To construct such a transition distribution
- Sample a proposal
$\theta$ from a proposal distribution$q(\theta \mid \theta')$ , - Accept the proposal with acceptance probability
$a(\theta' \to \theta)$ . (Otherwise, set$\theta = \theta'$ .)
Thus, \begin{align*} \pi(\theta \mid \theta') = \begin{cases} q(\theta \mid \theta') , a(\theta' \to \theta) & \text{if } \theta' \neq \theta \ \int q(\theta'' \mid \theta') , (1 - a(\theta' \to \theta'')) \dif \theta'' & \text{if } \theta' = \theta \end{cases} \end{align*}
Detailed balance is trivially satisfied when
WLOG, assume $ A(\theta' \to \theta) \leq 1$. (If it's not, its inverse
We can succinctly capture both cases with, \begin{align*} a(\theta' \to \theta) = \min \left{1, , A(\theta' \to \theta) \right } = \min \left{1, , \frac{p(\theta, x) , q(\theta' \mid \theta)}{p(\theta', x) , q(\theta \mid \theta')} \right }. \end{align*}
Now consider the special case in which the proposal distribution is symmetric; i.e.
This is called the Metropolis algorithm and it has close connections to simulated annealing.
Gibbs is a special case of MH with proposals that always accept. Gibbs sampling updates one "coordinate" of
:label: gibbs_sampling
**Input:** Initial parameters $\theta^{(0)}$, observations $x$
- **For** $t=1,\ldots, T$
- **For** $d=1,\ldots, D$
- Sample $\theta_d^{(t)} \sim p(\theta_d \mid \theta_{1}^{(t)}, \ldots, \theta_{d-}^{(t)}, \theta_{d+1}^{(t-1)}, \ldots, \theta_D^{(t-1)}, x)$
**Return** samples $\{\theta^{(t)}\}_{t=1}^T$
You can think of Gibbs as cycling through
In other words, the proposal distribution
What is the probability of accepting this proposal?
\begin{align*}
a_d(\theta' \to \theta)
&= \min \left{ 1, , \frac{p(\theta, x) q_d(\theta' \mid \theta)}{p(\theta', x) q_d(\theta \mid \theta')} \right} \
&= \min \left{ 1, , \frac{p(\theta, x) p(\theta_d' \mid \theta_{\neg d}, x) \delta_{\theta_{\neg d}}(\theta'{\neg d})}{p(\theta', x) p(\theta_d \mid \theta'{\neg d}, x) \delta_{\theta'{\neg d}}(\theta{\neg d})} \right} \
&= \min \left{ 1, , \frac{p(\theta_{\neg d}, x) p(\theta_d \mid \theta_{\neg d}, x) p(\theta_d' \mid \theta_{\neg d}, x) \delta_{\theta_{\neg d}}(\theta'{\neg d})}{p(\theta'{\neg d}, x) p(\theta'd \mid \theta'{\neg d}, x) p(\theta_d \mid \theta'{\neg d}, x) \delta{\theta'{\neg d}}(\theta{\neg d})} \right} \
&= \min \left{1, 1 \right} = 1
\end{align*}
for all
:::{admonition} The Godfather :class: tip
The Gibbs proposal is an offer you cannot refuse. :::Of course, if we only update one coordinate, the chain can't be ergodic. However, if we cycle through coordinates it generally will be.
:::{admonition} Questions
- Does the order in which we update coordinates matter?
- If Gibbs sampling always accepts, is it strictly better than other Metropolis-Hastings algorithms? :::
This was obviously a whirlwind of an introduction to Bayesian inference! There's plenty more to be said about Bayesian statistics — choosing a prior, subjective vs objective vs empirical Bayesian approaches, the role of the marginal likelihood in Bayesian model comparison, varieties of MCMC, and other approaches to approximate Bayesian inference. We'll dig into some of these topics as the course goes on, but for now, we have some valuable tools for developing Bayesian modeling and inference with discrete data!