This framework facilitates the training and evaluation of various deep neural networks for the task of image colorization. In particular, it offers the following colorization models, features and evaluation methods:
Colorization models
- ResNet Colorization Network
- Conditional GAN (CGAN)
- U-Net
Evaluation methods and metrics
- The Mean Squared Error (MSE)
- The Mean LPIPS Perceptual Similarity (PS)
- Semantic Interpretability (SI)
The framework is implemented in Python (3.6) using PyTorch v1.0.1.
Please consult ./env/mlp_env.yml
for a full list of the dependencies of the Conda environment that was used in the development of this framework.
If Conda is used as as package and environment manager, one can use
conda create --name myenv --file ./env/mlp_env.txt
to recreate the aforementioned environment.
train.py
- main entry point of the frameworksrc/options.py
- parses arguments (e.g. task specification, model options)src/main.py
- set-up of task environment (e.g. models, dataset, evaluation method)src/dataloaders.py
- downloads and (sub)samples datasets, and provides iterators over the dataset elements.src/models.py
- contains the implementations of the model architecturessrc/utils.py
- contains various helper functions and classessrc/colorizer.py
- trains and validates colorization modelssrc/classifier.py
- trains and validates image-classification models (used for SI)src/eval_gen
- contains helper functions for the evaluation of model colorizationssrc/eval_mse.py
- evaluates colorizations by MSEsrc/eval_ps.py
- evaluates colorizations by the Mean LPIPS Perceptual Similarity (PS)src/eval_si.py
- evaluates colorizations by Semantic Interpretability (SI)
Training of models
python train.py [--option ...]
where the options are:
option | description | type | oneOf | default |
---|---|---|---|---|
seed |
random seed | int |
not applicable | 0 |
task |
the task that should be executed | str |
['colorizer', 'classifier', 'eval-gen', 'eval-si', 'eval-ps', 'eval-mse'] |
'colorizer' |
experiment-name |
the name of the experiment | str |
not applicable | 'experiment_name' |
model-name |
colorization model architecture that should be used | str |
['resnet', 'unet32', 'unet224', 'nazerigan32', 'nazerigan224' 'cgan'] |
'resnet' |
model-suffix |
colorization model name suffix | str |
not applicable | not applicable |
model-path |
path for the pretrained models | str |
not applicable | './models' |
dataset-name |
the dataset to use | str |
['placeholder', 'cifar10', 'places100', 'places205', 'places365'] |
'placeholder' |
dataset-root-path |
dataset root path | str |
not applicable | './data' |
use-dataset-archive |
load dataset from TAR archive | str2bool |
[True, False] |
False |
output-root-path |
path for output (e.g. model weights, stats, colorizations) | str |
not applicable | './output' |
max-epochs |
maximum number of epochs to train for | int |
not applicable | 5 |
train-batch-size |
training batch size | int |
not applicable | 100 |
val-batch-size |
validation batch size | int |
not applicable | 100 |
batch-output-frequency |
frequency with which to output batch statistics | int |
not applicable | 1 |
max-images |
maximum number of images from the validation set to be saved (per epoch) | int |
not applicable | 10 |
eval-root-path |
the root path for evaluation images | str |
not applicable | './eval' |
eval-type |
the type of evaluation task to perform | str |
['original, 'grayscale', 'colorized'] |
'original' |
So one could for example train a cgan colorization model on the places365 dataset for 100 epochs by running:
python train.py \
--experiment-name cgan_experiment001 \
--model-name cgan \
--dataset-name places365 \
--max-epochs 100 \
--train-batch-size 16 \
--val-batch-size 16 \
The task of colorizing a image can be considered a pixel-wise regression problem where the model input X is a 1xHxW tensor containing the pixels of the grayscale imageand the model output Y' a tensor of shape nxHxW that represents the predicted colorization information. Specifically, the task aims to discover a mapping F: X → Y' that plausibly predicts the colorization given the greyscale input.
The CIE L*a*b* colour space lends itself well to this task since the L channel depicts the brightness of the image (X above) and the image colour is fully captured in the remaining a and b channels (Y' above). The L*a*b* colour model also has the advantage of being inspired by human colour perception, meaning that distances in L*a*b* space can be expected to be correlated with changes in human colour perception. The final output colorized image is created by recombining the input L layer with the predicted a and b layers.
Three colorization architectures are currently supported in the framework.
This architecture consists of a CNN that starts out with a set of convolutional layers which aim to extract low-level and semantic features from the set of input images, inspired by how representations are learned in Learning Representations for Automatic Colorization. Based on the same idea as behind the VGG-16-Gray architecture in this paper, a modified version of the image classification network that is ResNet-18 is used as a means to learn representations from a set of images. In particular, the network is modified in such a way that it accepts greyscale images and in addition, the network is truncated to six layers. This set of layers is used to extract features from the images that are represented by their lightness channels. Subsequently a series of deconvolutional layers is applied to increase the spacial resolution of (i.e. 'upscale') the features. This up-scaling of features learned in a network is inspired by the 'upsampling' of features in the colorization network of Let There Be Color!
This network is inspired by U-Net: Convolutional Networks for Biomedical Image Segmentation where direct connections are added between contracting and expanding layers of equal size to prevent the loss of spatial context of the original image throughout the layers. In Image Colorization with Generative Adversarial Networks an approach is proposed that uses such a network for colorization since the preservation of the original greyscale image is of particular importance to this task.
The network implemented in this paper has the same architecture as the one presented in the original U-Net paper (see image above), modified to take 224x224 inputs. Non-linearities are introduced by following convolutional and deconvolutional layers with leaky ReLUs with slope of 0.2. Furthermore batch normalisation is applied after every layer.
Recent research on image colorization has demonstrated the potential for using GAN architectures for image colorization tasks. One of the compelling aspects of using GANs is their ability to learn a loss function that is task-specific.
GANs consist of two networks: a generator and a discriminator. In the context of image colorization the generator’s
task is to produce colorized images that are indistinguishable from real images. The discriminator’s task is to classify
whether a sample came from the generator or from the original
set of images. Traditionally, the generator is represented by a mapping , where z is a random noise variable which serves as the input of the generator.
The discriminator is in a similar fashion represented by the mapping
where x represents a real or synthetic input.
In the context of image colorization, the traditional GAN has to be modified into a Conditional GAN (CGAN) such that it takes as image data as input instead of (random) noise. More specifically, the CGAN will take as input greyscale data (i.e. images represented by their lightness channel L in the L*a*b* colour space) and generate colorized images. The discriminator will be trained on both the generated colorized images and full-colour ground truth images.
Formally, the main objective of the CGAN can be described by a single mini-max game problem:
Where represents the original image distribution. So informally, the generator tries to minimise the function by generating samples according to a mapping
taking as input greyscale images x from the original data while the discriminator tries to maximise the same function by trying to distinguish between real images y from the original data distrbituion and generated samples
.
In addition, the framework facilitates the addition of an L1-regularisation term in order to try to force the generator to produce results that are 'closer’ (i.e. more similar) to images from the original data distribution. Theoretically, this should preserve the structure of the ground-truth images and in addition prevent the generator from prodcuing images where it has given certain pixels or even whole image regions a random colour just to deceive the discriminator.