The paper "Structured Denoising Diffusion Models in Discrete State-Spaces" introduces Discrete Denoising Diffusion Probabilistic Models (D3PMs), a generative model for discrete data that builds upon the success of Denoising Diffusion Probabilistic Models (DDPMs) in continuous spaces.
Structured Denoising Diffusion Models in Discrete State-Spaces by Austin et al.: https://arxiv.org/pdf/2107.03006
Diffusion models typically operate on continuous data like image pixels. This paper introduces a new approach called Discrete Denoising Diffusion Probabilistic Models (D3PMs) for data with a finite set of values, such as images with discrete color levels or text with a specific alphabet. D3PMs achieve this by employing a series of diffusion steps. In each step, a transition matrix, which captures the probability of transitioning from one state (e.g., a specific pixel value) to another, is used to gradually corrupt the data with noise. The model then learns to reverse this process, effectively denoising the data and generating new samples from the underlying distribution. This framework allows for incorporating domain knowledge by designing structured transition matrices that reflect specific data characteristics.
Imagine a data point
The diffusion process in D3PMs is governed by a sequence of transition matrices, one for each diffusion step t (from 0 to
Here,
These transition matrices essentially define the "corruption process" at each step. By multiplying the current data distribution
In essence, the categorical distribution captures the probabilities of the data points across different categories, while the transition matrices govern how these probabilities evolve with the noise injection during the diffusion process.
The effectiveness of D3PMs heavily relies on the concept of locality in the transition matrices. This principle dictates how the corruption process in each step is localized to the state itself or its immediate neighbors in the data space.
Here's how locality plays a role:
-
Limited Dependencies: The transition probability
$Q_t(x_i, x_j)$ should be significant only for$x_i$ and$x_j$ that are "close" to each other in the data space. - Reduced Complexity: By focusing on local transitions, the model complexity and computational cost are reduced compared to matrices with global dependencies.
- Incorporating Domain Knowledge: Locality allows us to design transition matrices that reflect specific data characteristics.
Examples of Transition Matrices:
- Uniform Transition Matrices: In a uniform transition matrix, each state has an equal probability of transitioning to any other state. This means that no matter what the current state is, it is equally likely to move to any of the possible states in the next step.
-
Absorbing State Matrices: In an absorbing state matrix, certain states are designated as absorbing states. Once the process transitions into an absorbing state, it remains there indefinitely. For example, if state
$K$ is an absorbing state, once the process reaches state$K$ , it will stay in state$K$ and not transition to any other state. - Discretized Gaussian Matrices: In a discretized Gaussian transition matrix, the probability of transitioning from one state to another is based on a Gaussian distribution. States that are closer to each other have higher transition probabilities, while states that are further apart have lower transition probabilities. This creates a smooth transition pattern where the process is more likely to move to nearby states.
The forward process in D3PMs refers to the gradual corruption of clean data with noise over a series of discrete steps. This process relies on categorical distributions, transition matrices, and the crucial property of convergence to a stationary distribution. Here's a breakdown with mathematical notation:
We start with clean data
The corruption unfolds over
- In each step t, a transition matrix
$Q_t$ (size$K \times K$ ) is applied. - This matrix encodes the probability of transitioning from any state
$x_i$ to any other state$x_j$ during step t:$Q_t(x_i, x_j)$ .
The core of the forward process is iteratively updating the data distribution based on the transition matrix:
The resulting
This process is repeated for all steps, with each
- The rows of the cumulative product
$M_t = Q_1 \cdot Q_2 \cdot \ldots \cdot Q_t$ (representing the overall transition across all steps up to t) must converge to a known stationary distribution as t approaches infinity.
A stationary distribution, denoted by
In simpler terms, after a sufficient number of diffusion steps, the data distribution stops changing and reaches a steady state represented by the stationary distribution, that we can use to sample from in the backward process.
So far we have shown how to distort the original data distribution
Note that the reverse process is conditioned on forward-looking time steps. Here the Markov transition matrix is parameterized somehow, and could actually also depend on the conditioning variable
Once the model reverse process is known, we can use it to generate samples from the data distribution
In this project, we explore the application of categorical diffusion processes to the Food101 dataset, focusing on the task of food classification by conditioning on image features. Although a simple convolutional neural network (CNN) could be employed to achieve effective image classification, our aim is to delve into the emerging domain of categorical diffusion. Our discrete space consists of 101 states, corresponding to the 101 categories of food in the dataset. By iteratively denoising random category indices and conditioning on the rich semantic embeddings provided by the images, we investigate the potential of diffusion models in handling categorical data. This approach not only broadens our understanding of diffusion processes in discrete spaces but also opens up new avenues for leveraging such models in various classification tasks.
| Ground Truth | Backward Predictions |
|---|---|
| Beef tartare | ![]() |
| Apple pie | ![]() |
| Breakfast Burrito | ![]() |




