Skip to content

Example tabular

Alexandre Abraham edited this page Jan 9, 2020 · 18 revisions

Welcome to our labeling and active learning plugin. In the following, we will introduce a simple project and highlight the best practices to build your own active learning project easily in DSS. We will use data coming from this repository but, for the purpose of this demo, we have formatted the data in a more convenient way.

Let's get started!

In this demo, we will set ourselves in a real-life situation. We are a software editing a Firefox/Chrome extension blocking clickbait news articles. We released a first version based on very simple rules which allowed us to gather some data based on user reports. These labeled samples are listed in the file called clickbait_reported_by_users.csv.

We now want to make our extension even better by adding a machine-learning based system. For this purpose, we have gathered news titles that are unfortunately not labeled yet. They are available in the file called clickbait_to_classify.csv

Setting up

Downloading and preparing the data

Download the two files linked above. In order to load them in DSS:

  • Upload them
  • In the format menu
    • Quoting style is Escaping only
    • Separator is _
    • Skip first lines is 0
    • Check Parse next line as column headers
  • Create!

Note: The data to classify has only one column, this is totally normal!

For this demo, we will need at some point to merge the user reported data with newly labeled one. We will therefore do an additional step now that may seem unnecessary but that will make things easier in the future. Select the clickbait_reported_by_users dataset and create a Stack recipe. Create as output a dataset named clickbait_stacked.

Your flow should look like this: Based flow with data

Set up your DSS python environment

For the plugin to work, you need to execute your workflow using Python 3. For this, go to your project settings, in the code env menu and select a Python 3 environment. Selecting project env

Building a first model

So far, we do not know if active learning is necessary for our problem. Let's create a first model to see which performance we can achieve. Click on the clickbait_stacked dataset, and then on Lab. Select Quick model > Prediction > label > Automated Machine Learning > Quick prototypes > Create. You are now set with an analysis. However, DSS has not detected the features you wanted to set as input. Go to Design > Feature handling and click on title. Set it as input, chose TF/IDF vectorization and adds English stop words. In Algorithms, disable Random forest, we will be working with Logistic Regression only.

Model configuration

Train the model

You can now hit the button Train! You should notice that the performance of the model is very low.

Click on the model and deploy it. DSS saves the current state of your model and makes it available in the flow for further prediction, this is called a Saved model in the DSS mumbo jumbo.

Putting active learning in place

Since our model performs badly, we would like to label more samples. Luckily, if you are a model whisperer, the model can tell you which samples he has trouble classifying, ie the samples that are the most interesting for him. Active learning does that for you automagically.

Creating the active learning recipe and the queries

Click on add recipe, chose the Labeling and Active Learning plugin and add a Query Sampler. Your saved model is the model you just deployed, the unlabeled data is clickbait_to_classify and you need to create a queries dataset. Let's call it clickbait_queries.

Active learning recipe configuration

Create the recipe and chose Lowest confidence sampling as strategy. Not to worry, on binary classification tasks, all the strategies are exactly the same. Run the recipe.

Setting up the labeling application

Start by creating a web application to label images:

  • Create a visual webapp, select Tabular data labeling. Call it Clickbait labeling.
  • Set the web app settings as follows
    • Input
      • For Unlabeled data, select clickbait_to_classify
      • For categories, set clickbait and legit
    • Output
      • For Labeling metadata dataset, create a new dataset named clickbait_metadata
      • For Labels dataset, we are going to create a new dataset called clickbait_labeled
      • Labels target column name is clickbait
    • Active Learning specific
      • Queries is clickbait_queries created with the active learning recipe

Do not forget to set the rights of your app on your datasets! We need reading right for inputs and writing rights for output: Dataset rights

You can save! Your configuration should look like: Webapp configuration

You can now start the webapp which will initialize the output datasets. Do not start labeling now though, we still need to do some preparation! Go back to your flow, you should see that the clickbait_labeled dataset that we have just created is not linked to our flow. We will add it by setting it as input of the stacking recipe. Open the recipe, go to Settings and click on Add input. Select clickbait_labeled. Run the recipe to update the schema. Your configuration should look like this: Stack configuration

And your flow: Final flow

Note: For identification purposes, the labeling webapp creates a unique hash proper to each sample. This is where the additional comes from. It is not used anywhere else so you should not include this column in any processing.

Setting up the training scenario

The active learning process is instrisically a loop in which the samples labeled by the user are used to select the following batch of samples. This loop takes place in DSS through the webapp, that takes the queries to fill the training data of the model, and a scenario that regularly train the model and generate new queries.

We are going to set up this scenario. Go to the scenario menu and create a new scenarion. Our plugin proposes a custom trigger that can be used to retrain the model every n labelings. Here are the steps to follow to put in place the training:

  • Create the scenario, add a custom trigger Every n labeling.
  • In the scenario steps, add
    • A rebuild of the stacked model
    • A retrain of the saved model
    • A rebuild of the queries dataset
    • A restart of the webapp

Trigger configuration

First scenario step

Second scenario step

Third scenario step

Fourth scenario step

Set the auto-trigger of the scenario on.

Preparing the dashboard

Go to Dashboards and create a dashboard, call it AL monitoring. In this dashboard, add a Metrics insight and set the settings as follow: Dashboard settings

Add it and set the Metrics options as history. Now, keep a tab opened with your dashboard to be able to watch the insights and open the webapp in a new tab.

Label data

This is it! You can launch the webapp by going to the view tab. Start labeling! The point of this experiment is to determine if the title of the news is a clickbait or not. Notice that you can use keystrokes to label images faster. Simply hit c or l, on your keyboard to select the class of the current tweet.

Labeling webapp

As you label, you will see notifications pop every 10 labeling. You can keep labeling and watch the dashboard from time to time. See the AUC rising!

AUC increasing