Let,
-
$N$ denote the number of data points -
$K$ denote the number of mixture components (i.e., clusters) -
$\mbx_n \in \reals^D$ denote the$n$ -th data point -
$z_n \in {1, \ldots, K}$ be a latent variable denoting the cluster assignment of the$n$ -th data point -
$\mbtheta_k$ be natural parameters of cluster$k$ . -
$\mbpi \in \Delta_{K-1}$ be cluster proportions (probabilities).
-
$\mbalpha \in \reals_+^{K}$ be the concentration of the prior on$\mbpi$ .
The generative model is as follows
Sample the proportions from a Dirichlet prior: \begin{align*} \mbpi &\sim \mathrm{Dir}(\mbalpha) \end{align*}
Sample the parameters for each component: \begin{align*} \mbtheta_k &\iid{\sim} p(\mbtheta) \quad \text{for } k = 1, \ldots, K \end{align*}
Sample the assignment of each data point: \begin{align*} z_n &\iid{\sim} \mbpi \quad \text{for } n = 1, \ldots, N \end{align*}
Sample data points given their assignments: \begin{align*} \mbx_n &\sim p(\mbx \mid \mbtheta_{z_n}) \quad \text{for } n = 1, \ldots, N \end{align*}
The joint distribution is, \begin{align*} p(\mbpi, {\mbtheta_k}{k=1}^K, {(z_n, \mbx_n)}{n=1}^N \mid \mbalpha) &\propto p(\mbpi \mid \mbalpha) \prod_{k=1}^K p(\mbtheta_k) \prod_{n=1}^N \prod_{k=1}^K \left[ \pi_k , p(\mbx_n \mid \mbtheta_k) \right]^{\bbI[z_n = k]} \end{align*}
Assume an exponential family likelihood of the form, \begin{align*} p(\mbx \mid \mbtheta_k) &= h(\mbx_n) \exp \left {\langle t(\mbx_n), \mbtheta_k \rangle - A(\mbtheta_k) \right } \end{align*} And a conjugate prior of the form, \begin{align*} p(\mbtheta_k \mid \mbphi, \nu) &\propto \exp \left { \langle \mbphi, \mbtheta_k \rangle - \nu A(\mbtheta_k) \right } \end{align*}
:::{admonition} Example: Gaussian Mixture Model (GMM)
Assume the conditional distribution of
The conjugate prior is a Gaussian prior on the mean: \begin{align*} p(\mbtheta_k) &= \mathrm{N}(\mbmu_0, \sigma_0^{2} \mbI) \nonumber \ &\propto \exp \left{-\tfrac{1}{2 \sigma_0^2} (\mbtheta_k - \mbmu_0)^\top (\mbtheta_k - \mbmu_0) \right} \end{align*}
In exponential family form, the prior precision is
:::
Let's stick with the Gaussian mixture model for now. Suppose we observe data points ${\mbx_n}{n=1}^N$ and want to infer the assignments ${z_n}{n=1}^N$ and means
We could obtain point estimates
:::{prf:algorithm} MAP Estimation for a GMM Repeat until convergence:
-
For each
$n=1,\ldots, N$ , fix the means$\mbtheta$ and set, \begin{align*} z_n &= \arg \min_{k \in {1,\ldots, K}} |\mbx_n - \mbtheta_k|_2 \end{align*} -
For each
$k=1,\ldots,K$ , fix all assignments$\mbz$ and set, \begin{align*} \mbtheta_k &= \frac{1}{N_k} \sum_{n=1}^K \bbI[z_n=k] \mbx_n \end{align*} :::
It turns out this algorithm goes by a more common name — you might recognize it as the k-means algorithm!
K-Means made hard assignments of data points to clusters in each iteration. That sounds a little extreme — do you really want to attribute a datapoint to a single class when it is right in the middle of two clusters? What if we used soft assignments instead?
:::{prf:algorithm} EM for a GMM Repeat until convergence:
-
For each data point
$n$ and component$k$ , compute the responsibility: \begin{align*} \omega_{nk} = \frac{\pi_k \mathrm{N}(\mbx_n \mid \mbtheta_k, \mbI)}{\sum_{j=1}^K \pi_j \mathrm{N}(\mbx_n \mid \mbtheta_j, \mbI)} \end{align*} -
For each component
$k$ , update the mean: \begin{align*} \mbtheta_k^\star &= \frac{1}{N_k} \sum_{n=1}^K \omega_{nk} \mbx_n \end{align*} :::
This is the Expectation-Maximization (EM) algorithm. As we will show, EM yields an estimate that maximizes the marginal likelihood of the data.
Rather than maximizing the joint probability, EM is maximizing the marginal probability,
\begin{align*}
\log p({\mbx_n}{n=1}^N, \mbtheta)
&= \log p(\mbtheta) + \log \sum{z_1=1}^K \cdots \sum_{z_N=1}^K p({\mbx_n, z_n}{n=1}^N \mid \mbtheta) \
&= \log p(\mbtheta) + \log \prod{n=1}^N \sum_{z_n=1}^K p(\mbx_n, z_n \mid \mbtheta) \
&= \log p(\mbtheta) + \sum_{n=1}^N \log \sum_{z_n=1}^K p(\mbx_n, z_n \mid \mbtheta)
\end{align*}
For discrete mixtures (with small enough
The key idea is to obtain a lower bound on the marginal probability,
\begin{align*}
\log p({\mbx_n}{n=1}^N, \mbtheta)
&= \log p(\mbtheta) + \sum{n=1}^N \log \sum_{z_n} p(\mbx_n, z_n \mid \mbtheta) \
&= \log p(\mbtheta) + \sum_{n=1}^N \log \sum_{z_n} q(z_n) \frac{p(\mbx_n, z_n \mid \mbtheta)}{q(z_n)} \
&= \log p(\mbtheta) + \sum_{n=1}^N \log \E_{q(z_n)} \left[\frac{p(\mbx_n, z_n \mid \mbtheta)}{q(z_n)} \right]
\end{align*}
where
:::{admonition} Jensen's Inequality
:class: tip
Jensen's inequality states that,
\begin{align*}
f(\E[Y]) \geq \E\left[ f(Y) \right]
\end{align*}
if
Applied to the log marginal probability, Jensen's inequality yields,
\begin{align*}
\log p({\mbx_n}{n=1}^N, \mbtheta)
&= \log p(\mbtheta) + \sum{n=1}^N \log \E_{q_n} \left[\frac{p(\mbx_n, z_n \mid \mbtheta)}{q_n(z_n)} \right] \
&\geq \log p(\mbtheta) + \sum_{n=1}^N \E_{q_n} \left[\log p(\mbx_n, z_n \mid \mbtheta) - \log q_n(z_n) \right] \
&\triangleq \cL[\mbtheta, \mbq]
\end{align*}
where
This is called the evidence lower bound, or ELBO for short. It is a function of
Suppose we fix
Now, recall our basic model,
Zooming in on just
As a function of
We also have two constraints:
Taking the partial derivative wrt
Note that \begin{align*} \omega_{nk}^\star &\propto p(z_n=k) , p(\mbx_n \mid z_n=k, \mbtheta) = p(z_n = k \mid \mbx_n, \mbtheta) . \end{align*} That is, the responsibilities equal the posterior probabilities!
Equivalently,
:::{admonition} EM as a minorize-maximize (MM) algorithm :class: tip Note that the The ELBO is tight after the E-step!.
We can view the EM algorihtm as a minorize-maximize (MM) algorithm where we iteratively lower bound the ELBO and and then maximize the lower bound. :::
Now let's consider the general Bayesian mixture with exponential family likelihoods and conjugate priors. As a function of
Zooming in on just
Recall that
In our first pass, we assumed
However, note that we can write the ELBO in a slightly different form,
\begin{align*}
\cL[\mbtheta, \mbq]
&= \log p(\mbtheta) + \sum_{n=1}^N \E_{q_n} \left[\log p(\mbx_n, z_n \mid \mbtheta) - \log q_n(z_n) \right] \
&= \log p(\mbtheta) + \sum_{n=1}^N \E_{q_n} \left[\log p(z_n \mid \mbx_n, \mbtheta) + \log p(\mbx_n \mid \mbtheta) - \log q_n(z_n) \right] \
&= \log p(\mbtheta) + \sum_{n=1}^N \left[\log p(\mbx_n \mid \mbtheta) -\KL{q_n(z_n)}{p(z_n \mid \mbx_n, \mbtheta)} \right] \
&= \log p({\mbx_n}{n=1}^N, \mbtheta) - \sum{n=1}^N \KL{q_n(z_n)}{p(z_n \mid \mbx_n, \mbtheta)}
\end{align*}
where
Recall, the KL divergence is defined as, \begin{align*} \KL{q(z)}{p(z)} &= \int q(z) \log \frac{q(z)}{p(z)} \dif z. \end{align*}
It gives a notion of how similar two distributions are, but it is not a metric! (It is not symmetric.) Still, it has some intuitive properties:
- It is non-negative,
$\KL{q(z)}{p(z)} \geq 0$ . - It equals zero iff the distributions are the same,
$\KL{q(z)}{p(z)} = 0 \iff q(z) = p(z)$ almost everywhere.
Maximizing the ELBO wrt
Mixture models are basic building blocks of statistics, and our first encounter with discrete latent variable models (LVMs). (Where have we seen continuous LVMs so far?) Mixture models have widespread uses in both density estimation (e.g., kernel density estimators) and data science (e.g., clustering). Next, we'll talk about how to extend mixture models to cases where the cluster assignments are correlated in time.