paper cutdown 2

homerjed · homerjed · commit 2494783d6fc5 · 2025-01-27T14:01:13.000+01:00
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -342,5 +342,10 @@ @article{score_matching2
     eprint = {https://direct.mit.edu/neco/article-pdf/23/7/1661/851298/neco\_a\_00142.pdf},
 }
 
-
-
+@software{azula,
+  author = {François Rozet},
+  title = {{Azula}: Diffusion models in {P}yTorch},
+  url = {https://github.com/probabilists/azula},
+  version = {0.2.04},
+  year = {2024},
+}
diff --git a/paper/paper.md b/paper/paper.md
@@ -30,74 +30,41 @@ aas-journal: Astrophysical Journal <- The name of the AAS journal.
 
 # Summary
 
-Diffusion models [@diffusion; @ddpm; @sde] have emerged as the dominant paradigm for generative modelling based on performance at a variety of tasks [@ldms; @dits]. The advantages of accurate density estimation and high-quality samples of normalising flows [@flows; @ffjord], VAEs [@vaes] and GANs [@gans] are subsumed into this method. Significant limitations exist on implicit and neural network based likelihood models with respect to modelling normalised probability distributions and sampling speed. Score-matching diffusion models are more efficient than previous generative model algorithms for these tasks. The diffusion process is agnostic to the data representation meaning different types of data such as audio, point-clouds, videos and images can be modelled. The use of generative models, such as diffusion models, remains somewhat unexplored given the amount of research into these methods in the machine learning community. In order to bridge the gap, trusted software is needed to allow research in the natural sciences using generative models.
+Diffusion models [@diffusion; @ddpm; @sde] have emerged as the dominant paradigm for generative modelling based on performance in a variety of tasks [@ldms; @dits]. The advantages of accurate density estimation and high-quality samples of normalising flows [@flows; @ffjord], VAEs [@vaes] and GANs [@gans] are subsumed into this method. Significant limitations exist on implicit and neural network based likelihood models with respect to modelling normalised probability distributions and sampling speed. Score-matching diffusion models are more efficient than previous generative model algorithms for these tasks. The diffusion process is agnostic to the data representation meaning different types of data such as audio, point-clouds, videos and images can be modelled. 
 
 # Statement of need
 
-Diffusion-based generative models [@diffusion; @ddpm] are a method for sampling from high-dimensional distributions. A sub-class of these models, score-based diffusion generatives models (SBGMs, [@sde]), permit exact-likelihood estimation via a change-of-variables associated with the forward diffusion process [@sde_ml]. Diffusion models allow fitting generative models to high-dimensional data in a more efficient way than normalising flows since only one neural network model parameterises the diffusion process as opposed to a sequence of neural networks in typical normalising flow architectures. Whilst existing diffusion models [@ddpm; @vdms] allow for sampling, they are limited to innaccurate variational inference approaches for density estimation which limits their use for Bayesian inference. This code provides density estimation with diffusion models using GPU enabled ODE solvers in `jax` [@jax] and `diffrax` [@kidger].
+Diffusion-based generative models [@diffusion; @ddpm] are a method for sampling from high-dimensional distributions. A sub-class of these models, score-based diffusion generatives models (SBGMs, [@sde]), permit exact-likelihood estimation via a change-of-variables associated with the forward diffusion process [@sde_ml]. Diffusion models allow fitting generative models to high-dimensional data in a more efficient way than normalising flows since only one neural network model parameterises the diffusion process as opposed to a sequence of neural networks in typical normalising flow architectures. Whilst existing diffusion models [@ddpm; @vdms] allow for sampling, they are limited to innaccurate variational inference approaches for density estimation which limits their use for Bayesian inference. This code provides density estimation with diffusion models using GPU enabled ODE solvers in `jax` [@jax] and `diffrax` [@kidger]. Similar codes (e.g. [@azula]) exist for diffusion models but they do not implement log-likelihood calculations, various network architectures and parallelised ODE-sampling.
 
-<!-- problems in cosmology, need for SBI --> 
+The software we present, `sbgm`, is designed to be used by researchers in machine learning and the natural sciences for fitting diffusion models with custom architectures for their research. These models can be fit easily with multi-accelerator training and inference within the code. Typical use cases for these kinds of generative models are emulator approaches [@emulating], simulation-based inference  [@sbi], field-level inference [@field_level_inference] and general inverse problems [@inverse_problem_medical; @Remy; @Feng2023; @Feng2024] (e.g. image inpainting [@sde] and denoising [@ambientdiffusion; @blinddiffusion]). This code allows for seemless integration of diffusion models to these applications by providing data-generating models with easy conditioning of the data on any modality. Furthermore, the implementation in `equinox` [@equinox] guarantees safe integration of `sbgm` with any other sampling libraries (e.g. BlackJAX @blackjax) or `jax` [@jax] based codes.
 
-The software we present, `sbgm`, is designed to be used by researchers in machine learning and the natural sciences for fitting diffusion models with a suite of custom architectures for their research. These models can be fit easily with multi-accelerator training and inference within the code. Typical use cases for these kinds of generative models are emulator approaches [@emulating], simulation-based inference  [@sbi], field-level inference [@field_level_inference] and general inverse problems [@inverse_problem_medical; @Remy; @Feng2023; @Feng2024] (e.g. image inpainting [@sde] and denoising [@ambientdiffusion; @blinddiffusion]). This code allows for seemless integration of diffusion models to these applications by providing data-generating models with easy conditioning of the data on parameters, classifying variables or other data such as images. Furthermore, the implementation in `equinox` [@equinox] guarantees safe integration of `sbgm` with any other sampling libraries (e.g. BlackJAX @blackjax) or `jax` [@jax] based codes.
-
-<!-- Other domains... audio etc -->
-
-![A diagram showing how to map data to a noise distribution (the prior) with an SDE, and reverse this SDE for generative modeling. One can also reverse the associated probability flow ODE, which yields a deterministic process that samples from the same distribution as the SDE. Both the reverse-time SDE and probability flow ODE can be obtained by estimating the score.\label{fig:sde_ode}](sde_ode.png)
+![A diagram showing how to map data to a noise distribution (the prior) with an SDE, and reverse this SDE for generative modeling. One can also reverse the associated probability flow ODE, which yields a deterministic reverse process. Both the reverse-time SDE and probability flow ODE can be obtained by estimating the score.\label{fig:sde_ode}](sde_ode.png)
 
 # Diffusion  
 
-<!-- What is diffusion -->
 Diffusion in the context of generative modelling describes the process of adding small amounts of noise sequentially to samples of data $\boldsymbol{x}$ [@diffusion]. A generative model for the data arises from training a neural network to reverse this process by subtracting the noise added to the data.
 
-We assume a data distribution $q(\boldsymbol{x}_0)$ and a continuous sequence of increasing noise levels $\sigma_t(t)$ as a function of the diffusion time $t$. The data is perturbed by Gaussian noise 
-
-<!-- What is a diffusion model -->
-Score-based diffusion models [@sde] model the forward diffusion process with Stochastic Differential Equations (SDEs) of the form
+Score-based diffusion models [@sde] model a forward diffusion process with Stochastic Differential Equations (SDEs) of the form
 
 $$
 \text{d}\boldsymbol{x} = f(\boldsymbol{x}, t)\text{d}t + g(t)\text{d}\boldsymbol{w},
 $$
 
-where $f(\boldsymbol{x}, t)$ is a vector-valued function called the drift coefficient, $g(t)$ is the diffusion coefficient and $\text{d}\boldsymbol{w}$ is a sample of noise $\text{d}\boldsymbol{w}\sim \mathcal{G}[\text{d}\boldsymbol{w}|\mathbf{0}, \mathbf{I}]$. This equation describes the infinitely many samples of noise along the diffusion time $t$ that perturb the data. The diffusion path begins at $t=0$ and ends at $T=0$ where the resulting distribution is then a multivariate Gaussian with mean zero and covariance $\mathbf{I}$.
-
-The SDE itself is parameterised by $t$ and $\boldsymbol{x}$ and existing options include the variance exploding (VE), variance preserving (VP) and sub-variance preserving (SubVP). 
-<!-- These equations describe how the mean and covariances of the distributions of noise added to the data evolve with time. -->
+where $f(\boldsymbol{x}, t)$ is a vector-valued function called the drift coefficient, $g(t)$ is the diffusion coefficient and $\text{d}\boldsymbol{w}$ is a sample of noise $\text{d}\boldsymbol{w}\sim \mathcal{G}[\text{d}\boldsymbol{w}|\mathbf{0}, \mathbf{I}]$. This equation describes the infinitely many samples of noise along the diffusion time $t$ that perturb the data. The diffusion path, defined by the SDE, begins at $t=0$ and ends at $T=0$ where the resulting distribution is then a multivariate Gaussian with mean zero and covariance $\mathbf{I}$.
 
 The reverse of the SDE, mapping from multivariate Gaussian samples $\boldsymbol{x}(T)$ to samples of data $\boldsymbol{x}(0)$, is of the form
 
 $$
 \text{d}\boldsymbol{x} = [f(\boldsymbol{x}, t) - g^2(t)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})]\text{d}t + g(t)\text{d}\boldsymbol{w},
 $$
 
-where the score function $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$ is substituted with a neural network $\boldsymbol{s}_{\theta}(\boldsymbol{x}(t), t)$ for the sampling process. The network is fit by score-matching [@score_matching; @score_matching2] across the time span $[0, T]$.
-<!-- The increment $\text{d}t$ is in the negative time direction for the reverse SDE. -->
-This network predicts the noise added to the image at time $t$ with the forward diffusion process, in accordance with the SDE, and removes it. With a data-dimensional sample of Gaussian noise from the prior $p_T(\boldsymbol{x})$ (see Figure \ref{fig:sde_ode}) one can reverse the diffusion process to generate data.
-
-<!-- The score-based diffusion model for the data is fit by optimising the parameters of the network $\theta$ via stochastic gradient descent of the score-matching loss [@sde] 
-
-$$
-    \mathcal{L}(\theta) = \mathbf{E}_{t\sim\mathcal{U}(0, T)}\mathbf{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}\mathbf{E}_{\boldsymbol{x}(t)\sim p(\boldsymbol{x}(t)|\boldsymbol{x})}[\lambda(t)||\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}(t)|\boldsymbol{x}(0)) - \boldsymbol{s}_{\theta}(\boldsymbol{x}(t),t)||_2^2]
-$$
-
-where $\lambda(t)$ is an arbitrary scalar weighting function, chosen to weight certain times - usually near $t=0$ where the data has only a small amount of noise added. Here, $p_t(\boldsymbol{x}(t)|\boldsymbol{x}(0))$ is the transition kernel for Gaussian diffusion paths. This is defined depending on the form of the SDE [@sde] and for the common variance-preserving (VP) SDE the kernel is written as 
-
-$$
-    p(\boldsymbol{x}(t)|\boldsymbol{x}(0)) = \mathcal{G}[\boldsymbol{x}(t)|\mu_t \cdot \boldsymbol{x}(0), \sigma^2_t \cdot \mathbf{I}]
-$$
-
-where $\mathcal{G}[\cdot]$ is a Gaussian distribution, $\mu_t=\exp(-\int_0^t\text{d}s \; \beta(s))$ and $\sigma^2_t = 1 - \mu_t$. $\beta(t)$ sets the mean and variance of the noise added at each time $t$ and is typically chosen to be a simple linear function of $t$. The VP SDE is expressed as 
-
-$$
-\text{d}\boldsymbol{x} = -\frac{1}{2}\beta(t)\boldsymbol{x}\text{d}t + g(t)\text{d}\boldsymbol{w},
-$$ -->
-
-In Figure \ref{fig:sde_ode} the forward and reverse diffusion processes are shown for a samples from a one-dimensional mixture of Gaussians with their corresponding SDE and ODE paths.
+where the score function $\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})$ is substituted with a neural network $\boldsymbol{s}_{\theta}(\boldsymbol{x}(t), t)$ for the sampling process. The network is fit by score-matching [@score_matching; @score_matching2] across the time span $[0, T]$. This network predicts the noise added to the image at time $t$ with the forward diffusion process, in accordance with the SDE, and removes it. With a data-dimensional sample of Gaussian noise from the prior $p_T(\boldsymbol{x})$ (see Figure \ref{fig:sde_ode}) one can reverse the diffusion process to generate data.
 
 The reverse SDE may be solved with Euler-Murayama sampling [@sde] (or other annealed Langevin sampling methods) which is featured in the code. 
 
 # Likelihood calculations with diffusion models 
 
-Many of the applications of generative models depend on being able to calculate the likelihood of data. In @sde it is shown that any SDE may be converted into an ordinary differential equation (ODE) without changing the distributions, defined by the SDE, from which the noise is sampled from in the diffusion process (denoted $p_t(x)$ and shown in grey in Figure \ref{fig:sde_ode}). This ODE is known as the probability flow ODE [@sde; @sde_ml] and is written
+Many of the applications of generative models depend on being able to calculate the likelihood of data. @sde show that any SDE may be converted into an ordinary differential equation (ODE) without changing the distributions, defined by the SDE, from which the noise is sampled from in the diffusion process (denoted $p_t(x)$ and shown in grey in Figure \ref{fig:sde_ode}). This ODE is known as the probability flow ODE [@sde; @sde_ml] and is written
 
 $$
     \text{d}\boldsymbol{x} = [f(\boldsymbol{x}, t) - g^2(t)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})]\text{d}t = f'(\boldsymbol{x}, t)\text{d}t.
@@ -119,18 +86,11 @@ $$
 \log p(\boldsymbol{x}(0)) = \log p(\boldsymbol{x}(T)) + \int_{t=0}^{t=T}\text{d}t \; \nabla_{\boldsymbol{x}}\cdot f(\boldsymbol{x}, t).
 $$
 
-The code implements these calculations also for the Hutchinson trace estimation method [@ffjord] that reduces the computational expense of the estimate. Figure \ref{fig:8gauss} shows an example of a data-likelihood calculation using a trained diffusion model with the ODE associated from an SDE. It is possible to train score-based diffusion models such that the score-matching loss bounds the Kullback-Leibler divergence for each data point $\boldsymbol{x}$ against the unknown data distribution (shown in [@sde_ml]). <!--  via a choice of $\lambda(t)$ termed the 'likelihood weighting'.  -->
-It is also implemented in the code such that the score-matching bounds the KL divergence between the model and unknown data distribution per datapoint.
+The code implements these calculations also for the Hutchinson trace estimation method [@ffjord] that reduces the computational expense of the estimate. Figure \ref{fig:8gauss} shows an example of a data-likelihood calculation using a trained diffusion model with the ODE associated from an SDE. 
 
 # Implementations and future work
 
-Diffusion models are defined in `sbgm` via a score-network model $\boldsymbol{s}_{\theta}$ and an SDE. All the availble SDEs in the literature of score-based diffusion models are available. We provide implementations for UNet [@unet], MLP-Mixer [@mixer] and Residual Network [@resnet] models which are state-of-the-art for diffusion tasks. It is possible to fit score-based diffusion models to a conditional distribution $p(\boldsymbol{x}|\boldsymbol{\pi}, \boldsymbol{y})$ where in typical inverse problems $\boldsymbol{y}$ would be an image and $\boldsymbol{\pi}$ a set of parameters in a physical model for the data [@batziolis] (e.g. to solve inverse problems). The code is compatible with any model written in the `equinox` [@equinox] framework. We are extending the code to provide transformer-based [@dits] and latent diffusion models [@ldms]. 
-
-<!-- Our implementation allows for the organisation of projects based on save/load configuration files, model and optimiser checkpointing and utility functions for plotting and saving metrics and sampled data. -->
-
-# GPU Support
-
-`sbgm` offers easy GPU support including the use of multiple GPU devices for training and sampling within the code.
+Diffusion models are defined in `sbgm` via a score-network model $\boldsymbol{s}_{\theta}$ and an SDE. All the availble SDEs (variance exploding (VE), variance preserving (VP) and sub-variance preserving (SubVP) [@sde]) in the literature of score-based diffusion models are available. We provide implementations for UNet [@unet], MLP-Mixer [@mixer] and Residual Network [@resnet] models which are state-of-the-art for diffusion tasks. It is possible to fit score-based diffusion models to a conditional distribution $p(\boldsymbol{x}|\boldsymbol{\pi}, \boldsymbol{y})$ where in typical inverse problems $\boldsymbol{y}$ would be an image and $\boldsymbol{\pi}$ a set of parameters in a physical model for the data [@batziolis] (e.g. to solve inverse problems). The code is compatible with any model written in the `equinox` [@equinox] framework. We are extending the code to provide transformer-based [@dits] and latent diffusion models [@ldms]. 
 
 # Acknowledgements