-
Notifications
You must be signed in to change notification settings - Fork 6
Example MNIST 1 2 3
TODO:
- train dataset is not necessary, remove it
- rescale and center image value
Welcome to our labeling and active learning plugin. In the following, we will introduce a simple project and highlight the best practices to build your own active learning project easily in DSS.
Let's get started!
For the plugin to work, you need to execute your workflow using Python 3. For this, go to your project settings, in the code env menu and select a Python 3 environment.
Before getting into the demo, we need the data! We will be working with image data. It can be done in DSS by:
- Creating a folder where the image files are stored
- Creating a dataset that references the images in the folder using their relative path, along with the labels and other features
For this demonstration, we will use data for which we already know the labels. The first thing to do is therefore to download it. For this, we provide a custom python script:
- Create a Python code recipe
- When asked for them, create 4 output:
- Managed folder MNIST-1-2-3-train
- Managed folder MNIST-1-2-3-test
- Dataset MNIST-1-2-3-train-data
- Dataset MNIST-1-2-3-test-data
- Copy / Paste the following code by changing the folder ids
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
from keras.datasets import mnist
from PIL import Image
import os
mnist_train_path = dataiku.Folder('FILL_ME').get_path()
mnist_test_path = dataiku.Folder('FILL_ME').get_path()
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Select only 1-2-3 images
print(x_train[0].shape)
train_data = []
for i in np.where(np.isin(y_train, [1, 2, 3]))[0]:
filename = 'train_{}.png'.format(i)
Image.fromarray(x_train[i], mode='L').save(os.path.join(mnist_train_path, filename))
train_data.append((filename, ))
test_data = []
for i in np.where(np.isin(y_test, [1, 2, 3]))[0]:
filename = 'test_{}.png'.format(i)
Image.fromarray(x_test[i], mode='L').save(os.path.join(mnist_test_path, filename))
test_data.append((filename, y_test[i]))
dataiku.Dataset('MNIST-1-2-3-train-data').write_with_schema(pd.DataFrame(train_data, columns=['path']))
dataiku.Dataset('MNIST-1-2-3-test-data').write_with_schema(pd.DataFrame(test_data, columns=['path', 'label']))
Do you notice something? Yes, the train data has no label, we just forgot it! We are now in a real life active learning situation, our training data is unlabeled.
The first step is to start annotating data and set up a prediction model. Who knows? Maybe 10 samples will be enough! Fortunately, DSS 6 features a simplified set up for web applications and will do all the tedious job for you.
Start by creating a web application to label images:
- Create a visual webapp, select "Image labeling". Call it "MNIST labeling".
- Set the web app settings as follows
-
Label
remainslabel
- For
Labeling metadata dataset
, selectMNIST-1-2-3-train
- For
Images
, selectMNIST-1-2-3
-
Labels
is the dataset that will store the handpicked labels for the images. We do not have one yet, create one namedMNIST-1-2-3-annotations
-
Queries
Create a dataset for this one too. It will store the samples selected by the active learning method to be labeled. Name itMNIST-1-2-3-queries
-
Categories
are the classes available for the classification. In this exemple, we are classifying 1, 2, and 3 from the MNIST dataset. Create one entry for each of them. - In the javascript security menu, grant all rights for the dataset you just created, and reading acces to
MNIST-1-2-3-train-data
. No need to grant access to test data, we will not be using it in the webapp.
-
You can save! Your configuration should look like:
This is it! You can launch the webapp by going to the view
tab. Start labeling! You do not need a lot of images to start doing active learning but be sure to have several of each class. I would say that 30 images is a good start.
Notice that you can use keystrokes to label images faster. Simply hit 1, 2, or 3, on your keyboard to select the class of the current image.
Once your are done, get back to your flow and open the MNIST-1-2-3-annotations
dataset. You can see your first annotations there!
Let's now learn a model on our annotations. Click on the MNIST-1-2-3-annotations
dataset, and then on Lab. Select Quick model
> Prediction
> label
> Automated Machine Learning
> Quick prototypes
> Create
. You are now set with an analysis. However, DSS should have detected that you have no usable feature since the image path cannot be used as-is.
Select Design
> Feature handling
> path
. Select the feature as input, indicate it is a Text, and select Custom preprocessing
as handling. The problem we are facing now is that DSS has no builtin to load images. Fortunately, a preprocessor is provided in our plugin. Just copy paste the following code that loads it from the plugin:
import numpy as np
import dataiku
dataiku.import_from_plugin('labeling-and-active-learning', 'image')
from image import ImgLoader
processor = ImgLoader('MPNMCFiI')
Note that using images as vectors in a model is not the best practice. One should rely on feature extraction — such as SIFT descriptors — or deep learning. As it happens, using features directly works with the MNIST dataset and is a simpler way to introduce active learning. For a more realistic demo, see the wall crack detecction (requires keras).
You can now hit the button Train
! You should notice that the performance of the model is very low.