Skip to content

Example MNIST 1 2 3

Alexandre Abraham edited this page Dec 12, 2019 · 3 revisions

TODO:

  • train dataset is not necessary, remove it
  • rescale and center image value

Welcome to our labeling and active learning plugin. In the following, we will introduce a simple project and highlight the best practices to build your own active learning project easily in DSS.

Let's get started!

Setting up

Set your DSS python environment

For the plugin to work, you need to execute your workflow using Python 3. For this, go to your project settings, in the code env menu and select a Python 3 environment. Selecting project env

Downloading the data

Before getting into the demo, we need the data! We will be working with image data. It can be done in DSS by:

  • Creating a folder where the image files are stored
  • Creating a dataset that references the images in the folder using their relative path, along with the labels and other features

For this demonstration, we will use data for which we already know the labels. The first thing to do is therefore to download it. For this, we provide a custom python script:

  • Create a Python code recipe
  • When asked for them, create 4 output:
    • Managed folder MNIST-1-2-3-train
    • Managed folder MNIST-1-2-3-test
    • Dataset MNIST-1-2-3-train-data
    • Dataset MNIST-1-2-3-test-data
  • Copy / Paste the following code by changing the folder ids
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
from keras.datasets import mnist
from PIL import Image
import os


mnist_train_path = dataiku.Folder('FILL_ME').get_path()
mnist_test_path = dataiku.Folder('FILL_ME').get_path()

(x_train, y_train), (x_test, y_test) = mnist.load_data()


# Select only 1-2-3 images
print(x_train[0].shape)

train_data = []
for i in np.where(np.isin(y_train, [1, 2, 3]))[0]:
    filename = 'train_{}.png'.format(i)
    Image.fromarray(x_train[i], mode='L').save(os.path.join(mnist_train_path, filename))
    train_data.append((filename, ))

test_data = []
for i in np.where(np.isin(y_test, [1, 2, 3]))[0]:
    filename = 'test_{}.png'.format(i)
    Image.fromarray(x_test[i], mode='L').save(os.path.join(mnist_test_path, filename))
    test_data.append((filename, y_test[i]))
    
dataiku.Dataset('MNIST-1-2-3-train-data').write_with_schema(pd.DataFrame(train_data, columns=['path']))
dataiku.Dataset('MNIST-1-2-3-test-data').write_with_schema(pd.DataFrame(test_data, columns=['path', 'label']))

Do you notice something? Yes, the train data has no label, we just forgot it! We are now in a real life active learning situation, our training data is unlabeled.

Labeling some data

The first step is to start annotating data and set up a prediction model. Who knows? Maybe 10 samples will be enough! Fortunately, DSS 6 features a simplified set up for web applications and will do all the tedious job for you.

Setting up the labeling application

Start by creating a web application to label images:

  • Create a visual webapp, select "Image labeling". Call it "MNIST labeling".
  • Set the web app settings as follows
    • Label remains label
    • For Labeling metadata dataset, select MNIST-1-2-3-train
    • For Images, select MNIST-1-2-3
    • Labels is the dataset that will store the handpicked labels for the images. We do not have one yet, create one named MNIST-1-2-3-annotations
    • Queries Create a dataset for this one too. It will store the samples selected by the active learning method to be labeled. Name it MNIST-1-2-3-queries
    • Categories are the classes available for the classification. In this exemple, we are classifying 1, 2, and 3 from the MNIST dataset. Create one entry for each of them.
    • In the javascript security menu, grant all rights for the dataset you just created, and reading acces to MNIST-1-2-3-train-data. No need to grant access to test data, we will not be using it in the webapp.

You can save! Your configuration should look like:

Webapp configuration

Label some images!

This is it! You can launch the webapp by going to the view tab. Start labeling! You do not need a lot of images to start doing active learning but be sure to have several of each class. I would say that 30 images is a good start.

Notice that you can use keystrokes to label images faster. Simply hit 1, 2, or 3, on your keyboard to select the class of the current image.

Once your are done, get back to your flow and open the MNIST-1-2-3-annotations dataset. You can see your first annotations there!

Annotations dataset.

Learning a model

Let's now learn a model on our annotations. Click on the MNIST-1-2-3-annotations dataset, and then on Lab. Select Quick model > Prediction > label > Automated Machine Learning > Quick prototypes > Create. You are now set with an analysis. However, DSS should have detected that you have no usable feature since the image path cannot be used as-is.

Loading images as vectors

Select Design > Feature handling > path. Select the feature as input, indicate it is a Text, and select Custom preprocessing as handling. The problem we are facing now is that DSS has no builtin to load images. Fortunately, a preprocessor is provided in our plugin. Just copy paste the following code that loads it from the plugin:

import numpy as np
import dataiku

dataiku.import_from_plugin('labeling-and-active-learning', 'image')
from image import ImgLoader

processor = ImgLoader('MPNMCFiI')

Note that using images as vectors in a model is not the best practice. One should rely on feature extraction — such as SIFT descriptors — or deep learning. As it happens, using features directly works with the MNIST dataset and is a simpler way to introduce active learning. For a more realistic demo, see the wall crack detecction (requires keras).

Train the model

You can now hit the button Train! You should notice that the performance of the model is very low.