Denoising diffusion probabilistic models (DDPMs) {cite:p}sohl2015deep,ho2020denoising are the deep generative models underlying image generation tools like DALL-E 2 (from Open AI) and Stable Diffusion (from Stability AI). This lecture will unpack how they work. These notes are partly inspired by {cite:t}turner2024denoising.
Diffusion models work by
- Using a fixed (i.e., not learned), user-defined noising process to convert data into noise.
- Learning the inverse this process so that starting from noise, we can generate samples that approximate the data distribution.
We can think of the DDPM as a giant latent variable model, where the latent variables are noisy versions of the data. As with other latent variable models (e.g., VAEs), once we've inferred the latent variables, the problem of learning the mapping from latents to observed data reduces to a supervised regression problem.
DDPMs were originally proposed for modeling continuous data,
Let
For continuous data, the standard noising process is a first-order Gaussian autoregressive (AR) process,
\begin{align*}
q(x_t \mid x_{t-1}) &= \mathrm{N}(x_t \mid \lambda_t x_{t-1}, \sigma_{t}^2).
\end{align*}
The hyperparameters
Since the noising process has linear Gaussian dynamics, we can compute conditional distributions in closed form.
::::{admonition} Computing the conditional distributions
Show that the conditional distributions are,
\begin{align*}
q(x_t \mid x_0)
&= \mathrm{N}\left( x_t \mid \lambda_{t|0} x_0, \sigma_{t|0}^2 \right)
\end{align*}
where the parameters are defined recursively,
\begin{align*}
\lambda_{t|0} &= \lambda_t \lambda_{t-1|0} = \prod_{s=1}^t \lambda_s \
\sigma_{t|0}^2 &= \lambda_t^2 \sigma_{t-1|0}^2 + \sigma_t^2
\end{align*}
with base case
:::{admonition} Solution :class: dropdown, tip
Assume the equality above holds for
The first latent is distributed as
:::
::::
It is common to set,
\begin{align*}
\sigma_t^2 &= 1-\lambda_t^2,
\end{align*}
in which case the conditional variance simplifies to,
\begin{align*}
\sigma_{t|0}^2 = 1 - \prod_{s=1}^t \lambda_s^2 = 1 - \lambda_{t|0}^2.
\end{align*}
Under this setting, the noising process preserves the variance of the marginal distributions. If
Consider the following two limits:
-
As
$T \to \infty$ , the conditional distribution goes to a standard normal,$q(x_T \mid x_0) \to \mathrm{N}(0, 1)$ , which makes the marginal distribution$q(x_T)$ easy to sample from. -
When
$\lambda_t \to 1$ , the noising process adds infinitesimal noise so that$x_t \approx x_{t-1}$ , which makes the inverse process easier to learn.
Of course, these two limits are in conflict with one another. If we add a small amount of noise at each time step, the inverse process is easier to learn, but we need to take many time steps to converge to a Gaussian stationary distribution.
The generative process is a parameteric model that learns to invert the noise process,
\begin{align*}
p(x_{0:T}; \theta) &= p(x_T) \prod_{t=T-1}^0 p(x_t \mid x_{t+1}; \theta).
\end{align*}
The initial distribution
Like the other latent variable models we studied in this course, we will estimate the parameters by maximizing an evidence lower bound (ELBO),
\begin{align*}
\cL(\theta)
&= \E_{q(x_0)} \E_{q(x_{1:T} \mid x_0)} \left[ \log p(x_{0:T}; \theta) - \log q(x_{1:T} \mid x_0) \right] \
&= \E_{q(x_0)} \E_{q(x_{1:T} \mid x_0)} \left[ \log p(x_{0:T}; \theta) \right] + c,
\end{align*}
where
We can simplify further by expanding the log probability of the generative model, \begin{align*} \cL(\theta) &= \E_{q(x_0)} \sum_{t=0}^{T-1} \E_{q(x_t, x_{t+1} \mid x_0)} \left[ \log p(x_t \mid x_{t+1}; \theta) \right] \ &\propto \E_{q(x_0)} \mathbb{E}{t \sim \mathrm{Unif}(0,T-1)} \E{q(x_t, x_{t+1} \mid x_0)} \left[ \log p(x_t \mid x_{t+1}; \theta) \right] \end{align*} which only depends on pairwise conditionals.
Since the noising process above adds a small amount of Gaussian noise at each step, it is reasonable to model the generative process as Gaussian as well,
\begin{align*}
p(x_t \mid x_{t+1}; \theta)
&=
\mathrm{N}(x_t \mid \mu_\theta(x_{t+1}, t), \widetilde{\sigma}t^2)
\end{align*}
where $\mu\theta: \reals \times [0,T] \mapsto \reals$ is a nonlinear mean function that should denoise
:::{admonition} Parameter sharing
Rather than learn a separate function for each time point, it is common to parameterize the mean function as a function of both the state
:::{admonition} Generative process variance
You could try to learn the generative process variance as a function of
-
$\widetilde{\sigma}_t^2 = \sigma_t^2 = 1-\lambda_t^2$ , the conditional variance in the noising process, which tends to overestimate the conditional variance of the true generative process; or - $\widetilde{\sigma}t^2 = \Var_q[x_t \mid x_0, x{t+1}]$, the conditional variance of the noising process given the data
$x_0$ and the next state$x_{t+1}$ . This tends to underestimate the conditional variance of the true generative process. :::
Under this Gaussian model for the generative process, we can analytically compute one of the expectations in the ELBO. This is called Rao-Blackwellization. It reduces the variance of the objective, which is good for SGD!
Using the chain rule and the Gaussian generative model, we have, \begin{align*} \E_{q(x_t, x_{t+1} \mid x_0)} \left[ \log p(x_t \mid x_{t+1}; \theta) \right] &= \E_{q(x_{t+1} \mid x_0)} \E_{q(x_t \mid x_{t+1}, x_0)} \left[\log \mathrm{N}(x_t \mid \mu_\theta(x_{t+1}, t), \widetilde{\sigma}_t^2) \right] \end{align*}
We already computed the conditional distribution
::::{admonition} Conditionals of a Gaussian noising process
Show that
\begin{align*}
q(x_t \mid x_{t+1}, x_0)
&= \mathrm{N}(x_t \mid \mu_{t|t+1,0}, \sigma_{t|t+1,0}^2)
\end{align*}
where
\begin{align*}
\mu_{t|t+1,0} &= a_t x_0 + b_t x_{t+1}
\end{align*}
is a linear combination of
&= \left(\frac{1}{\sigma_{t|0}^2} + \frac{\lambda_{t+1}^2}{\sigma_{t+1}^2} \right)^{-1}
\end{align*}
:::{admonition} Solution
:class: dropdown, tip
By Bayes rule and the Markovian nature of the noising process,
\begin{align*}
q(x_t \mid x_{t+1}, x_0)
&\propto q(x_t \mid x_0) , q(x_{t+1} \mid x_t) \
&= \mathrm{N}(x_t \mid \lambda_{t|0} x_0, \sigma_{t|0}^2) , \mathrm{N}(x_{t+1} \mid \lambda_{t+1} x_t, \sigma_{t+1}^2) \
&= \mathrm{N}(x_t \mid \mu_{t|t+1,0}, \sigma_{t|t+1,0}^2)
\end{align*}
where, by completing the square,
\begin{align*}
\sigma_{t|t+1,0}^2 &= \left(\frac{1}{\sigma_{t|0}^2} + \frac{\lambda_{t+1}^2}{\sigma_{t+1}^2} \right)^{-1} \
\mu_{t|t+1,0} &= \sigma_{t|t+1,0}^2 \left(\frac{\lambda_{t|0} x_0}{\sigma_{t|0}^2} + \frac{\lambda_{t+1} x_{t+1}}{\sigma_{t+1}^2} \right).
\end{align*}
The forms for
::::
Finally, to simplify the objective we need the Gaussian cross-entropy,
::::{admonition} Gaussian cross-entropy
Let
Putting it all together,
\begin{align*}
\cL(\theta)
&= \E_{q(x_0)} \E_t \E_{q(x_{t+1} \mid x_0)} \E_{q(x_t | x_0, x_{t+1})} \left[ \log p(x_t \mid x_{t+1}; \theta) \right] \
&= \E_{q(x_0)} \E_t \E_{q(x_{t+1} \mid x_0)} \left[ \log \mathrm{N}(a_t x_0 + b_t x_{t+1} \mid \mu_\theta(x_{t+1}, t), \widetilde{\sigma}t^2) -\frac{1}{2} \frac{\sigma{t|t+1,0}^2}{\widetilde{\sigma}t^2} \right] \
&= \frac{1}{2} \E{q(x_0)} \E_t \E_{q(x_{t+1} \mid x_0)} \left[ \frac{1}{\widetilde{\sigma}t^2} \left( a_t x_0 + b_t x{t+1} - \mu_\theta(x_{t+1}, t) \right)^2 \right] + c
\end{align*}
where we have absorbed terms that are independent of
The loss function above suggests a particular form of the mean function,
\begin{align*}
\mu_\theta(x_{t+1}, t) &= a_t \hat{x}0(x{t+1}, t; \theta) + b_t x_{t+1},
\end{align*}
where the only part that is learned is $\hat{x}0(x{t+1}, t; \theta)$, a function that attempts to denoise the current state. Since
Under this parameterization, the loss function reduces to, \begin{align*} \cL(\theta) &= \frac{1}{2} \E_{q(x_0)} \E_t \E_{q(x_{t+1} \mid x_0)} \left[ \frac{a_t^2}{\widetilde{\sigma}_t^2} \left(x_0 - \hat{x}0(x{t+1}, t; \theta) \right)^2 \right] + c \end{align*}
One nice thing about this formulation is that the mean function is always outputting "the same thing" — an estimate of the completely denoised data,
The generative process attempts to invert the noising process, but what is the actual inverse of the process? Since the noising process is a Markov chain, the reverse of the noising process must be Markovian as well. That is,
\begin{align*}
q(x_{0:T}) &= q(x_T) \prod_{t=T-1}^0 q(x_t \mid x_{t+1})
\end{align*}
for some sequence of transition distributions
Finally, we can give a simpler form for the optimal generative process,
\begin{align*}
q(x_t \mid x_{t+1})
&= \sum_{i=1}^n w_i(x_{t+1}) , q(x_t \mid x_0^{(i)}, x_{t+1}) \
&= \sum_{i=1}^n w_i(x_{t+1}) , \mathrm{N}(x_t \mid a_t x_0^{(i)} + b_t x_{t+1}, \sigma_{t|t+1,0}^2),
\end{align*}
which we recognize as a mixture of Gaussians, all with the same variance, with means biased toward each of the
For small step sizes, that mixture of Gaussians can be approximated by a single Gaussian with mean equal to the expected value of the mixture, \begin{align*} \E[x_t \mid x_{t+1}] &= \sum_{i=1}^n w_i(x_{t+1}) , \left(a_t x_0^{(i)} + b_t x_{t+1} \right) \end{align*} For small steps, this expected value is approximately, \begin{align*} \E[x_t \mid x_{t+1}] &\approx \frac{x_{t+1}}{\lambda_{t+1}} + \sigma_{t+1}^2 \sum_{i=1}^n w_i(x_{t+1}) \left(\frac{ \lambda_{t|0} x_0^{(i)} - x_{t+1}}{\sigma_{t|0}^2} \right) \end{align*}
:::{admonition} Derivation
:class: dropdown
Using the definitions from above,
\begin{align*}
a_t &= \frac{\sigma_{t|t+1,0}^2 , \lambda_{t|0} }{\sigma_{t|0}^2} \
b_t &= \frac{\sigma_{t|t+1,0}^2 , \lambda_{t+1}}{\sigma_{t+1}^2}
\end{align*}
we can write
\begin{align*}
\E[x_t \mid x_{t+1}]
&= \frac{\sigma_{t|t+1,0}^2 \lambda_{t+1}}{\sigma_{t+1}^2} x_{t+1} + \sum_{i=1}^n w_i(x_{t+1}) \frac{\sigma_{t|t+1,0}^2 \lambda_{t|0}}{\sigma_{t|0}^2} x_0^{(i)}
\end{align*}
Now add and subtract
Though it's not immediately obvious, the second term in the expectation is related to the marginal probability, \begin{align*} q(x_t) &= \frac{1}{n} \sum_{i=1}^n q(x_t \mid x_0^{(i)}) \ &= \frac{1}{n} \sum_{i=1}^n \mathrm{N}(x_t \mid \lambda_{t|0} x_0^{(i)}, \sigma_{t|0}^2) \end{align*} Specifically, the second term is the Stein score function of the marginal probability, \begin{align*} \nabla_{x} \log q_t(x_{t+1}) &= \frac{\nabla_{x} q_t(x_{t+1})}{q_t(x_{t+1})} \ &=\frac{\frac{1}{n} \sum_{i=1}^n \mathrm{N}(x_{t+1} \mid \lambda_{t|0} x_0^{(i)}, \sigma_{t|0}^2) \left(-\frac{(x_{t+1} - \lambda_{t|0}x_0^{(i)})}{\sigma_{t|0}^2} \right)}{\frac{1}{n} \sum_{j=1}^n \mathrm{N}(x_{t+1} \mid \lambda_{t|0} x_0^{(j)}, \sigma_{t|0}^2) } \ &= \sum_{i=1}^n w_i(x_{t+1}) \left(\frac{\lambda_{t|0}x_0^{(i)} - x_{t+1}}{\sigma_{t|0}^2} \right) \end{align*}
:::{admonition} Final form :class: tip Putting it all together, for small steps, the reverse process is approximately Gaussian with mean and variance, \begin{align*} \E[x_t \mid x_{t+1}] &\approx \frac{x_{t+1}}{\lambda_{t+1}} + \sigma_{t+1}^2 \nabla_x \log q_t(x_{t+1}) \ \Var[x_t \mid x_{t+1}] &\approx \sigma_{t+1}^2. \end{align*} This has a nice interpretation: to invert the noise process, first undo the contraction and then take a step in the direction of the Stein score! :::
In practice, the best performing diffusion models are based on a continuous-time formulation of the noising process as an SDE {cite:p}song2020score. To motivate this approach, think of the noise process above as a discretization of a continuous process
The reverse (generative) process can be cast as an SDE as well! Following our derivation of the inverse process above, we can show that the reverse process is,
\begin{align*}
\dif X = \left[f(x, t) - g(t)^2 \nabla_x \log q_t(x)\right] \dif t + g(t) \dif W
\end{align*}
where
Very few things need to change in order to apply this idea to multidimensional data
There's a lot we didn't cover. The Stein score function that appeared in the inverse of the noising process allows for connections between denoising score matching {cite:p}song2019generative and denoising diffusion models.
Another important topic is conditional generation. Suppose we want to take in text and spit out images, like DALL-E 2 or Stable Diffusion. One way to do so is using a diffusion model, but to steer the reverse diffusion based on the text prompt. So rather than just following the score function, the reverse process is also biased toward outputs that match the prompt.
Finally, this class was nominally about models for discrete data, but this lecture has focused on continuous diffusions. There has been recent work on discrete denoising diffusion models {cite:p}campbell2022continuous, which we'll have to cover another time!