minimal paper corrections; notation, added refs

homerjed · web-flow · commit 8c2522aa27ee · 2025-12-03T19:18:13.000+01:00
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -384,3 +384,13 @@ @article{Hutchinson
         https://doi.org/10.1080/03610919008812866
     }
 }
+
+@misc{lipman2023flowmatchinggenerativemodeling,
+      title={Flow Matching for Generative Modeling}, 
+      author={Yaron Lipman and Ricky T. Q. Chen and Heli Ben-Hamu and Maximilian Nickel and Matt Le},
+      year={2023},
+      eprint={2210.02747},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2210.02747}, 
+}
diff --git a/paper/paper.md b/paper/paper.md
@@ -34,9 +34,9 @@ Diffusion models [@diffusion; @ddpm; @sde] have emerged as the dominant paradigm
 
 # Statement of need
 
-Diffusion-based generative models [@diffusion; @ddpm] are a method for sampling from high-dimensional distributions. A sub-class of these models, score-based diffusion generatives models (SBGMs, [@sde]), permit exact-likelihood estimation via a change-of-variables associated with the forward diffusion process [@sde_ml]. Diffusion models allow fitting generative models to high-dimensional data in a more efficient way than normalising flows since only one neural network model parameterises the diffusion process as opposed to a sequence of neural networks in typical normalising flow architectures. Whilst existing diffusion models [@ddpm; @vdms] allow for sampling, they are limited to innaccurate variational inference approaches for density estimation which limits their use for Bayesian inference. This code provides density estimation with diffusion models using GPU enabled ODE solvers in `jax` [@jax] and `diffrax` [@kidger]. Similar codes (e.g. [@azula]) exist for diffusion models but they do not implement log-likelihood calculations, various network architectures and parallelised ODE-sampling.
+Diffusion-based generative models [@diffusion; @ddpm] are a method for sampling from high-dimensional distributions. A sub-class of these models, score-based diffusion generatives models (SBGMs, [@sde]), permit exact-likelihood estimation via a change-of-variables associated with the forward diffusion process [@sde_ml]. Diffusion models allow fitting generative models to high-dimensional data in a more efficient way than normalising flows since only one neural network model parameterises the diffusion process as opposed to a sequence of neural networks in typical normalising flow architectures. Whilst existing diffusion models [@ddpm; @vdms] allow for sampling, they are limited to innaccurate variational inference approaches for density estimation which limits their use for Bayesian inference. This code provides density estimation with diffusion models using GPU enabled ODE solvers in `jax` [@jax] and `diffrax` [@kidger]. Similar codes (e.g. [@azula]) exist for diffusion models but they do not implement log-likelihood calculations, various network architectures and parallelised computations for optimisation and SDE/ODE-sampling.
 
-The software we present, `sbgm`, is designed to be used by researchers in machine learning and the natural sciences for fitting diffusion models with custom architectures for their research. These models can be fit easily with multi-accelerator training and inference within the code. Typical use cases for these kinds of generative models are emulator approaches [@emulating], simulation-based inference  [@sbi], field-level inference [@field_level_inference] and general inverse problems [@inverse_problem_medical; @Remy; @Feng2023; @Feng2024] (e.g. image inpainting [@sde] and denoising [@ambientdiffusion; @blinddiffusion]). This code allows for seemless integration of diffusion models to these applications by providing data-generating models with easy conditioning of the data on any modality. Furthermore, the implementation in `equinox` [@equinox] guarantees safe integration of `sbgm` with any other sampling libraries (e.g. BlackJAX @blackjax) or `jax` [@jax] based codes.
+The software we present, `sbgm`, is designed to be used by researchers in machine learning and the natural sciences for fitting diffusion models with custom architectures for their research. These models can be fit easily with multi-accelerator training and inference routines within the code (with demonstration examples provided). Typical use cases for these kinds of generative models are emulator approaches [@emulating], simulation-based inference  [@sbi], field-level inference [@field_level_inference] and general inverse problems [@inverse_problem_medical; @Remy; @Feng2023; @Feng2024] (e.g. image inpainting [@sde] and denoising [@ambientdiffusion; @blinddiffusion]). This code allows for seemless integration of diffusion models to these applications by providing data-generating models with easy conditioning of the data on any modality (e.g. images, audio or model parameters). Furthermore, the implementation in `equinox` [@equinox] guarantees safe integration of `sbgm` with any other sampling libraries (e.g. BlackJAX @blackjax) or `jax` [@jax] based codes.
 
 ![A diagram showing how to map data to a noise distribution (the prior) with an SDE, and reverse this SDE for generative modeling. One can also reverse the associated probability flow ODE, which yields a deterministic reverse process. Both the reverse-time SDE and probability flow ODE can be obtained by estimating the score.\label{fig:sde_ode}](sde_ode.png)
 
@@ -47,50 +47,50 @@ Diffusion in the context of generative modelling describes the process of adding
 Score-based diffusion models [@sde] model a forward diffusion process with Stochastic Differential Equations (SDEs) of the form
 
 $$
-\text{d}\boldsymbol{x} = f(\boldsymbol{x}, t)\text{d}t + g(t)\text{d}\boldsymbol{w},
+\text{d}\boldsymbol{x}_t = f(\boldsymbol{x}_t, t)\text{d}t + g(t)\text{d}\boldsymbol{w}_t,
 $$
 
-where $f(\boldsymbol{x}, t)$ is a vector-valued function called the drift coefficient, $g(t)$ is the diffusion coefficient and $\text{d}\boldsymbol{w}$ is a sample of noise $\text{d}\boldsymbol{w}\sim \mathcal{G}[\text{d}\boldsymbol{w}|\mathbf{0}, \mathbf{I}]$. This equation describes the infinitely many samples of noise along the diffusion time $t$ that perturb the data. The diffusion path, defined by the SDE, begins at $t=0$ and ends at $T=0$ where the resulting distribution is then a multivariate Gaussian with mean zero and covariance $\mathbf{I}$.
+where $f(\boldsymbol{x}_t, t)$ is a vector-valued function called the drift coefficient, $g(t)$ is the diffusion coefficient and $\text{d}\boldsymbol{w}_t$ is a sample of noise $\text{d}\boldsymbol{w}_t\sim \mathcal{G}[\text{d}\boldsymbol{w}_t|\mathbf{0}, \mathbf{I}_{\boldsymbol{x}_t}]$. This equation describes the infinitely many samples of noise along the diffusion time $t$ that perturb the data. The diffusion path, defined by the SDE, begins at $t=0$ and ends at $T=0$ where the resulting distribution is then a multivariate Gaussian with mean zero and covariance $\mathbf{I}$. The code implements various SDEs known in the diffusion model literature.
 
-The reverse of the SDE, mapping from multivariate Gaussian samples $\boldsymbol{x}(T)$ to samples of data $\boldsymbol{x}(0)$, is of the form
+The reverse of the SDE, mapping from multivariate Gaussian samples $\boldsymbol{x}(T)$ to samples of data $\boldsymbol{x}_0$, is of the form
 
 $$
-\text{d}\boldsymbol{x} = [f(\boldsymbol{x}, t) - g^2(t)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})]\text{d}t + g(t)\text{d}\boldsymbol{w},
+\text{d}\boldsymbol{x}_t = [f(\boldsymbol{x}_t, t) - g^2(t)\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)]\text{d}t + g(t)\text{d}\boldsymbol{w}_t,
 $$
 
-where the score function $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$ is substituted with a neural network $\boldsymbol{s}_{\theta}(\boldsymbol{x}(t), t)$ for the sampling process. The network is fit by score-matching [@score_matching; @score_matching2] across the time span $[0, T]$. This network predicts the noise added to the image at time $t$ with the forward diffusion process, in accordance with the SDE, and removes it. With a data-dimensional sample of Gaussian noise from the prior $p_T(\boldsymbol{x})$ (see Figure \ref{fig:sde_ode}) one can reverse the diffusion process to generate data.
+where the score function $\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)$ is substituted with a neural network $\boldsymbol{s}_{\theta}(\boldsymbol{x}(t), t)$ for the sampling process. The network is fit by score-matching [@score_matching; @score_matching2] across the time span $[0, T]$. This network predicts the noise added to the image at time $t$ with the forward diffusion process, in accordance with the SDE, and removes it. With a data-dimensional sample of Gaussian noise from the prior $p_T(\boldsymbol{x})$ (see Figure \ref{fig:sde_ode}) one can reverse the diffusion process to generate data.
 
 The reverse SDE may be solved with Euler-Murayama sampling [@sde] (or other annealed Langevin sampling methods) which is featured in the code. 
 
 # Likelihood calculations with diffusion models 
 
-Many of the applications of generative models depend on being able to calculate the likelihood of data. @sde show that any SDE may be converted into an ordinary differential equation (ODE) without changing the distributions, defined by the SDE, from which the noise is sampled from in the diffusion process (denoted $p_t(x)$ and shown in grey in Figure \ref{fig:sde_ode}). This ODE is known as the probability flow ODE [@sde; @sde_ml] and is written
+Many of the applications of generative models depend on being able to calculate the likelihood of data. @sde show that any SDE may be converted into an ordinary differential equation (ODE) without changing the distributions $p_t(\boldsymbol{x}_t)$, defined by the SDE, from which the noise is sampled from in the diffusion process (denoted $p_t(x)$ and shown in grey in Figure \ref{fig:sde_ode}). This ODE is known as the probability flow ODE [@sde; @sde_ml] and is written
 
 $$
-    \text{d}\boldsymbol{x} = [f(\boldsymbol{x}, t) - g^2(t)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})]\text{d}t = f'(\boldsymbol{x}, t)\text{d}t.
+    \text{d}\boldsymbol{x}_t = [f(\boldsymbol{x}_t, t) - g^2(t)\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)]\text{d}t = f'(\boldsymbol{x}_t, t)\text{d}t.
 $$
 
-This ODE can be solved with an initial-value problem. Starting with a data point $\boldsymbol{x}(0)\sim p(\boldsymbol{x})$, this point is mapped along the probability flow ODE path (see the right-hand side of Figure \ref{fig:sde_ode}) to a sample from the multivariate Gaussian prior. This inherits the formalism of continuous normalising flows [@neuralodes; @ffjord] without the expensive ODE simulations used to train these models - allowing for a likelihood estimate based on diffusion models [@sde_ml]. The initial value problem provides a solution $\boldsymbol{x}(T)$ and the change in probability along the path $\Delta=\log p(\boldsymbol{x}(0)) - \log p(\boldsymbol{x}(T))$ where $p(\boldsymbol{x}(T))$ is a simple multivariate Gaussian distribution.
+This ODE can be solved with an initial-value problem to sample new data or estimate its density. Starting with a data point $\boldsymbol{x}_0 \sim p(\boldsymbol{x})=p_0(\boldsymbol{x}_0)$, this point is mapped along the probability flow ODE path (see the right-hand side of Figure \ref{fig:sde_ode}) to a sample from the multivariate Gaussian prior $x_T \sim p_T(\boldsymbol{x}_T)$. This inherits the formalism of continuous normalising flows [@neuralodes; @ffjord] without the expensive ODE simulations used to train these models - allowing for a likelihood estimate based on diffusion models [@sde_ml]. The initial value problem provides a solution $\boldsymbol{x}_T$ and the change in probability along the path $\Delta=\log p_0(\boldsymbol{x}_0) - \log p_T(\boldsymbol{x}_T)$ where $p_T(\boldsymbol{x}_T)$ is a simple multivariate Gaussian distribution. Various ODE solvers of different orders are available (for a user to balance speed and accuracy of sampling) which are provided by `diffrax` [@kidger].
 
-![A diagram showing a log-likelihood calculation over the support of a Gaussian mixture model with eight components. Data is drawn (shown in red) from this mixture to train the diffusion model that gives the likelihood in gray. The log-likelihood is calculated using the ODE and a trained diffusion model. \label{fig:8gauss}](8gauss.png){ width=50% } 
+![A diagram showing a log-likelihood calculation over the support of a Gaussian mixture model with eight components. Data is drawn (shown in red) from this mixture to train the diffusion model that gives the likelihood (defined by the diffusion model) in gray. The log-likelihood is calculated using the ODE and a trained diffusion model. \label{fig:8gauss}](8gauss.png){ width=50% } 
 
 The likelihood estimate under a score-based diffusion model is estimated by solving the change-of-variables equation for continuous normalising flows.
 
 $$
-\frac{\partial}{\partial t} \log p(\boldsymbol{x}(t)) = \nabla_{\boldsymbol{x}} \cdot f(\boldsymbol{x}(t), t),
+\frac{\partial}{\partial t} \log p_t(\boldsymbol{x}_t) = \nabla_{\boldsymbol{x}_t} \cdot f(\boldsymbol{x}_t, t),
 $$
 
-which gives the log-likelihood of a single datapoint $\boldsymbol{x}(0)$ as 
+which gives the log-likelihood of a single datapoint $\boldsymbol{x}_0$ as 
 
 $$
-\log p(\boldsymbol{x}(0)) = \log p(\boldsymbol{x}(T)) + \int_{t=0}^{t=T}\text{d}t \; \nabla_{\boldsymbol{x}}\cdot f(\boldsymbol{x}, t).
+\log p(\boldsymbol{x}_0) = \log p(\boldsymbol{x}_T) + \int_{t=0}^{t=T}\text{d}t \; \nabla_{\boldsymbol{x}_t}\cdot f(\boldsymbol{x}_t, t).
 $$
 
 The code implements these calculations also for the Hutchinson trace estimation method [@ffjord, @Hutchinson] that reduces the computational expense of the estimate. Figure \ref{fig:8gauss} shows an example of a data-likelihood calculation using a trained diffusion model with the ODE associated from an SDE. 
 
 # Implementations and future work
 
-Diffusion models are defined in `sbgm` via a score-network model $\boldsymbol{s}_{\theta}$ and an SDE. All the availble SDEs (variance exploding (VE), variance preserving (VP) and sub-variance preserving (SubVP) [@sde]) in the literature of score-based diffusion models are available. We provide implementations for UNet [@unet], Diffusion Transformers [@dit], MLP-Mixer [@mixer] and Residual Network [@resnet] models which are state-of-the-art for diffusion tasks. It is possible to fit score-based diffusion models to a conditional distribution $p(\boldsymbol{x}|\boldsymbol{\pi}, \boldsymbol{y})$ where in typical inverse problems $\boldsymbol{y}$ would be an image and $\boldsymbol{\pi}$ a set of parameters in a physical model for the data [@conditional_diffusion] (e.g. to solve inverse problems). The code is compatible with any model written in the `equinox` [@equinox] framework. We are extending the code to provide transformer-based [@dits] and latent diffusion models [@ldms]. 
+Diffusion models are defined in `sbgm` via a score-network model $\boldsymbol{s}_{\theta}$ and an SDE. All the availble SDEs (variance exploding (VE), variance preserving (VP) and sub-variance preserving (SubVP) [@sde]) in the literature of score-based diffusion models are available. We provide implementations for UNet [@unet], Diffusion Transformers [@dit], MLP-Mixer [@mixer] and Residual Network [@resnet] models which are state-of-the-art for diffusion tasks. It is possible to fit score-based diffusion models to a conditional distribution $p(\boldsymbol{x}|\boldsymbol{\pi}, \boldsymbol{y})$ where in typical inverse problems $\boldsymbol{y}$ would be an image and $\boldsymbol{\pi}$ a set of parameters in a physical model for the data [@conditional_diffusion] (e.g. to solve inverse problems). The code is compatible with any model written in the `equinox` [@equinox] framework. We recently extended the code to provide transformer-based diffusion models [@dits] and plan to extend to latent diffusion models [@ldms] and flow matching [@lipman2023flowmatchinggenerativemodeling]. 
 
 # Acknowledgements
 

Original file line number	Diff line number	Diff line change
`@@ -384,3 +384,13 @@ @article{Hutchinson`
`384`	`384`	`https://doi.org/10.1080/03610919008812866`
`385`	`385`	`}`
`386`	`386`	`}`
	`387`	`+`
	`388`	`+@misc{lipman2023flowmatchinggenerativemodeling,`
	`389`	`+ title={Flow Matching for Generative Modeling},`
	`390`	`+ author={Yaron Lipman and Ricky T. Q. Chen and Heli Ben-Hamu and Maximilian Nickel and Matt Le},`
	`391`	`+ year={2023},`
	`392`	`+ eprint={2210.02747},`
	`393`	`+ archivePrefix={arXiv},`
	`394`	`+ primaryClass={cs.LG},`
	`395`	`+ url={https://arxiv.org/abs/2210.02747},`
	`396`	`+}`