In this project we will implement a variational auto encoder (VAE) using (de)convolutional neural networks ((d)CNN) using pytorch library, so to perform several image analysis tasks to the MNIST dataset of digits. We consider 50.000 training image dataset, 10.000 validation dataset and finally the last 10.000 image dataset for test purposes. The images have
VAE consists of 2 parts: encoder and decoder. Both of these parts are neural networks with similar architectures. The encoder learns the distribution the parameters
The decoder is used to learn the parameters
The loss function is expressed as an optimization (maximization) problem of the conditional log-likelihood, or the lower bound of it (ELBO) since it is very costly to compute the log-likelihood.
It is proven that for maximizing the ELBO, meaning to be as close as possible to the true value, we need to minimize the KL-divergence since the difference of the
Also, we are going to discuss and evaluate the variations of distributions that we used for our model. We tested Gaussian distribution, Beta distributions with [0, 1], Categorical distributions with discretised (binned) data,and Bernoulli distributions with re-interpreted data as probabilities for a given pixel to be black or white. In this specific task Bernoulli distribution (BD) is the most sufficient choice. Bernoulli distribution makes sense for black and white (i.e. binary) images. The Bernoulli distribution is binary, so it assumes that observations may only have two possible outcomes and this matches our dataset specification. Generalizing this statement we can see that when we use distributions with range [0, 1] we obtain better results because of the "nature" of our data.
Finally, for the first part we will investigate the structure of the latent variables Z, and see how it captures structure that was implicitly present in the data. At first the latent space is two-dimensional, i.e. such that
One variation would be to train our model on the latent space which is K-dimensional and we will reproduce the same procedure as before. In our setting we specify the dimensions to be 16. When the model is trained we use 1000 test data points and we use the encoder so to store its
Rather than directly reconstruct images, one can perform operations on the encoded datapoints by working in the latent space, and then using the decoder part of the model to translate these operations to the data-domain.
In this different approach we will use linear interpolation so to obtain a sample
Finally we can use our decoder to obtain the sample
We plot a figure containing a grid. On each row, the leftmost entry is
For this final step, we will attempt to estimate the distribution
We consider a single data point
We use the Bernoulli distribution to construct
In implementation, we set the decoder to be non-trainable and construct two trainable parameters
For a single data point
We randomly choose 10 data points from the testing data, the reconstruction of them are shown in the code (we optimise Ψ separately each time). The figure in the code containing a 3-column grid where on each row, we display the original datapoint x, the reconstructed datapoint x' that we obtained, and a reconstruction x'' that we obtained as the first task (vae_bernulli.ipynb).