-
Notifications
You must be signed in to change notification settings - Fork 6
Example tabular
Welcome to our labeling and active learning plugin. In the following, we will introduce a simple project and highlight the best practices to build your own active learning project easily in DSS. We will use data coming from this repository but, for the purpose of this demo, we have formatted the data in a more convenient way.
Let's get started!
In this demo, we will set ourselves in a real-life situation. We are a software editing a Firefox/Chrome extension blocking clickbait news articles. We released a first version based on very simple rules which allowed us to gather some data based on user reports. These labeled samples are listed in the file called clickbait_reported_by_users.csv.
We now want to make our extension even better by adding a machine-learning based system. For this purpose, we have gathered news titles that are unfortunately not labeled yet. They are available in the file called clickbait_to_classify.csv
Download the two files linked above. In order to load them in DSS:
- Upload them
- In the format menu
- Quoting style is
Escaping only
- Separator is
_
- Skip first lines is
0
- Check
Parse next line as column headers
- Quoting style is
- Create!
Note: The data to classify has only one column, this is totally normal!
For this demo, we will need at some point to merge the user reported data with newly labeled one. We will therefore do an additional step now that may seem unnecessary but that will make things easier in the future. Select the clickbait_reported_by_users
dataset and create a Stack
recipe. Create as output a dataset named clickbait_stacked
.
Your flow should look like this:
For the plugin to work, you need to execute your workflow using Python 3. For this, go to your project settings, in the code env menu and select a Python 3 environment.
So far, we do not know if active learning is necessary for our problem. Let's create a first model to see which performance we can achieve. Click on the clickbait_stacked
dataset, and then on Lab. Select Quick model
> Prediction
> label
> Automated Machine Learning
> Quick prototypes
> Create
. You are now set with an analysis. However, DSS has not detected the features you wanted to set as input. Go to Design
> Feature handling
and click on title
. Set it as input, chose TF/IDF vectorization and adds English stop words. In Algorithms
, disable Random forest, we will be working with Logistic Regression only.
You can now hit the button Train
! You should notice that the performance of the model is very low.
Click on the model and deploy it. DSS saves the current state of your model and makes it available in the flow for further prediction, this is called a Saved model
in the DSS mumbo jumbo.
Since our model performs badly, we would like to label more samples. Luckily, if you are a model whisperer, the model can tell you which samples he has trouble classifying, ie the samples that are the most interesting for him. Active learning does that for you automagically.
Click on add recipe, chose the Labeling and Active Learning
plugin and add a Query Sampler. Your saved model is the model you just deployed, the unlabeled data is clickbait_to_classify
and you need to create a queries dataset. Let's call it clickbait_queries
.
Create the recipe and chose Lowest confidence sampling
as strategy. Not to worry, on binary classification tasks, all the strategies are exactly the same. Run the recipe.
Start by creating a web application to label images:
- Create a visual webapp, select
Tabular data labeling
. Call itClickbait labeling
. - Set the web app settings as follows
- Input
- For
Unlabeled data
, selectclickbait_to_classify
- For categories, set
clickbait
andlegit
- For
- Output
- For
Labeling metadata dataset
, create a new dataset namedclickbait_metadata
- For
Labels dataset
, we are going to create a new dataset calledclickbait_labeled
-
Labels target column name
isclickbait
- For
- Active Learning specific
-
Queries
isclickbait_queries
created with the active learning recipe
-
- Input
Do not forget to set the rights of your app on your datasets! We need reading right for inputs and writing rights for output:
You can save! Your configuration should look like:
You can now start the webapp which will initialize the output datasets. Do not start labeling now though, we still need to do some preparation! Go back to your flow, you should see that the clickbait_labeled
dataset that we have just created is not linked to our flow. We will add it by setting it as input of the stacking recipe. Open the recipe, go to Settings
and click on Add input
. Select clickbait_labeled
. Run the recipe to update the schema. Your configuration should look like this:
And your flow:
Note: For identification purposes, the labeling webapp creates a unique hash proper to each sample. This is where the additional comes from. It is not used anywhere else so you should not include this column in any processing.
The active learning process is instrisically a loop in which the samples labeled by the user are used to select the following batch of samples. This loop takes place in DSS through the webapp, that takes the queries to fill the training data of the model, and a scenario that regularly train the model and generate new queries.
We are going to set up this scenario. Go to the scenario menu and create a new scenarion. Our plugin proposes a custom trigger that can be used to retrain the model every n labelings. Here are the steps to follow to put in place the training:
- Create the scenario, add a custom trigger
Every n labeling
. - In the scenario steps, add
- A rebuild of the stacked model
- A retrain of the saved model
- A rebuild of the queries dataset
- A restart of the webapp
Set the auto-trigger of the scenario on.
Go to Dashboards and create a dashboard, call it AL monitoring
. In this dashboard, add a Metrics
insight and set the settings as follow:
Add it and set the Metrics options as history
. Now, keep a tab opened with your dashboard to be able to watch the insights and open the webapp in a new tab.
This is it! You can launch the webapp by going to the view
tab. Start labeling! The point of this experiment is to determine if the title of the news is a clickbait or not. Notice that you can use keystrokes to label images faster. Simply hit c or l, on your keyboard to select the class of the current tweet.
As you label, you will see notifications pop every 10 labeling. You can keep labeling and watch the dashboard from time to time. See the AUC rising!