Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
297 changes: 297 additions & 0 deletions best-practices/multitarget-stacking/multitarget_stacking.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
---
title: "Multi-Target Stacking"
author: "Megan Kurka"
date: "August 17, 2018"
output:
md_document:
variant: markdown_github
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, error = FALSE)
```

## Multi-Target Stacking

Multi-Target stacking is a process used to predict multiple columns. With typical machine learning approaches, a single target column is selected. In this example, we will try to predict the columns `bad_loan` and `int_rate` using our cleaned lending club data: https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv.

The goal of this use case is to determine who will pay back their loan payments and for those who will pay back their loan payments - what should be the interest rate. This would typically be done by training two models:

* predict the target column: `bad_loan`
* predict the target column: `int_rate`

We can assume that `bad_loan` and `int_rate` are highly correlated and the information about `int_rate` may help us predict `bad_loan` and vice versa. Rather than split the problem up into two models, we propose a multi-target stacking approach.

## Multi-Target Stacking in H2O-3

Multi-Target stacking is performed in three steps:

1. Train base models on each target column
2. Extract cross validation predictions to create final model data
3. Train final models for each target using the cross validation predictions as features


### Train Base Models

We will train a base model for each target encoding using cross validation.

```{r, results = "hide"}
library('h2o')
h2o.init()
h2o.no_progress()

df <- h2o.importFile("https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv")
df$bad_loan <- as.factor(df$bad_loan)
```

We will randomly split the data into 75% training and 25% testing. We will use the testing data to evaluate how well the model performs.

```{r}
# Split Frame into training and testing
splits <- h2o.splitFrame(df, seed = 1234, destination_frames=c("train.hex", "test.hex"), ratios = 0.75)
train <- splits[[1]]
test <- splits[[2]]
```

In this next step, we train a model for each target column. We will set `keep_cross_validation_predictions = TRUE` so that we can use the predictions later for our final models.

```{r}
predictors <- c("loan_amnt", "emp_length", "annual_inc", "dti", "delinq_2yrs", "revol_util", "total_acc",
"longest_credit_length", "verification_status", "term", "purpose", "home_ownership")

# Train base model to predict bad_loan
gbm_bad_loan_base <- h2o.gbm(x = predictors, y = "bad_loan",
training_frame = train, nfolds = 5, # 5-fold cross validation
keep_cross_validation_predictions = TRUE,
score_each_iteration = TRUE, ntrees = 500,
seed = 1234,
stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.001,
model_id = "gbm-bad_loan-base.hex")

# Train base model to predict int_rate
gbm_int_rate_base <- h2o.gbm(x = predictors, y = "int_rate",
training_frame = train, nfolds = 5, # 5-fold cross validation
keep_cross_validation_predictions = TRUE,
score_each_iteration = TRUE, ntrees = 500,
seed = 1234,
stopping_rounds = 5, stopping_metric = "MAE", stopping_tolerance = 0.001,
model_id = "gbm-int_rate-base.hex")

```

What would our model performance be if we knew `int_rate` at the time of `bad_loan` and vice versa? If these additional features help our model performance, this could indicate that Multi-Target Stacking can help.

```{r}
# Train model to predict bad_loan using int_rate
gbm_bad_loan_all_x <- h2o.gbm(x = c(predictors, "int_rate"), y = "bad_loan",
training_frame = train, nfolds = 5, # 5-fold cross validation
keep_cross_validation_predictions = TRUE,
score_each_iteration = TRUE, ntrees = 500,
seed = 1234,
stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.001,
model_id = "gbm-bad_loan-all_x.hex")

# Train model to predict int_rate using bad_loan
gbm_int_rate_all_x <- h2o.gbm(x = c(predictors, "bad_loan"), y = "int_rate",
training_frame = train, nfolds = 5, # 5-fold cross validation
keep_cross_validation_predictions = TRUE,
score_each_iteration = TRUE, ntrees = 500,
seed = 1234,
stopping_rounds = 5, stopping_metric = "MAE", stopping_tolerance = 0.001,
model_id = "gbm-int_rate-all_x.hex")

```

The performance metrics on the testing data is shown below:
```{r, echo = FALSE}
# Get Performance Metrics
performance_comparison <- data.frame('Model' = c("baseline", "all predictors"),
'bad_loan_AUC'=c(h2o.auc(h2o.performance(gbm_bad_loan_base, test)),
h2o.auc(h2o.performance(gbm_bad_loan_all_x, test))),
'int_rate_MAE' = c(h2o.mae(h2o.performance(gbm_int_rate_base, test)),
h2o.mae(h2o.performance(gbm_int_rate_all_x, test))))
```


```{r, echo = FALSE}
library('knitr')
performance_comparison$bad_loan_AUC <- round(performance_comparison$bad_loan_AUC, 4)
performance_comparison$int_rate_MAE <- round(performance_comparison$int_rate_MAE, 4)
kable(performance_comparison, row.names = FALSE, col.names = c("Model", "AUC: bad_loan", "MAE: int_rate"))
```

We have much better performance in predicting `bad_loan` when we know the loan's interest rate. Likewise, we have much better performance in predicting `int_rate` when we know if the loan will be fully paid off. This indicates that stacking our base model predictions may help improve performance.

### Create Final Model Data

Now that we have our base models, we can add our cross validation hold out predictions to our training data. We cannot simply predict using the base models on the training data because then we were in danger of overfitting. Any prediction we use should be on some hold-out data.

We will extend our dataset with the holdout predictions and use this extended data to train our final models.

```{r}
bad_loan_preds <- h2o.cross_validation_holdout_predictions(gbm_bad_loan_base)$p1
colnames(bad_loan_preds) <- c("bad_loan_pred")

int_rate_preds <- h2o.cross_validation_holdout_predictions(gbm_int_rate_base)$predict
colnames(int_rate_preds) <- c("int_rate_pred")

ext_train <- h2o.cbind(train, bad_loan_preds, int_rate_preds)
head(ext_train)
```

We can add these same predictions to our test data. Note that for our test dataset, we do not need to use the holdout cross validation predictions. Since the test data was not seen during training, we can simply predict using our base models on the test data to get our additional columns.

```{r}
bad_loan_preds <- h2o.predict(gbm_bad_loan_base, test)$p1
colnames(bad_loan_preds) <- c("bad_loan_pred")

int_rate_preds <- h2o.predict(gbm_int_rate_base, test)$predict
colnames(int_rate_preds) <- c("int_rate_pred")

ext_test <- h2o.cbind(test, bad_loan_preds, int_rate_preds)
```

### Train Final Models

Now that we have our extended training and testing data with predictions for `bad_loan` and `int_rate`, we can train our final models.

Our final models are the same as our base models, however, the `bad_loan` model has the additional feature: `int_rate_pred` and the `int_rate` model has the additional feature: `bad_loan_pred`.

```{r}

gbm_bad_loan_final <- h2o.gbm(x = c(predictors, "int_rate_pred"), y = "bad_loan",
training_frame = ext_train, validation_frame = ext_test,
ntrees = 500, score_each_iteration = TRUE,
stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.001,
model_id = "gbm-bad_loan-final.hex")

gbm_int_rate_final <- h2o.gbm(x = c(predictors, "bad_loan_pred"), y = "int_rate",
training_frame = ext_train, validation_frame = ext_test,
ntrees = 500, score_each_iteration = TRUE,
stopping_rounds = 5, stopping_metric = "MAE", stopping_tolerance = 0.001,
model_id = "gbm-int_rate-final.hex")
```

The performance metrics on the testing data is shown below:
```{r, echo = FALSE}
# Get Performance Metrics
performance_comparison <- data.frame('Model' = c("baseline", "multi-target_stacking"),
'bad_loan_AUC'=c(h2o.auc(h2o.performance(gbm_bad_loan_base, test)),
h2o.auc(h2o.performance(gbm_bad_loan_final, ext_test))),
'int_rate_MAE' = c(h2o.mae(h2o.performance(gbm_int_rate_base, test)),
h2o.mae(h2o.performance(gbm_int_rate_final, ext_test))))
```


```{r, echo = FALSE}
performance_comparison$bad_loan_AUC <- round(performance_comparison$bad_loan_AUC, 4)
performance_comparison$int_rate_MAE <- round(performance_comparison$int_rate_MAE, 4)
kable(performance_comparison, row.names = FALSE, col.names = c("Model", "AUC: bad_loan", "MAE: int_rate"))
```

We can see that the performance improves when we use the Multi-Target Stacking method compared to simply training a model per target.


## Putting It All Together

We will put the steps together into one function that trains Multi-Target Stacking and one function that scores Multi-Target Stacking.

```{r}

# Train Multi-Target Stacking

TrainMultiTargetStacking <- function(training_frame, validation_frame, x, y,
nfolds = 5, seed = -1, score_tree_interval = 1){

message("Train Base Models")

base_models <- list()
for(i in y){
base_model <- h2o.gbm(x = x, y = i,
training_frame = training_frame,
nfolds = nfolds, seed = seed, keep_cross_validation_predictions = TRUE,
ntrees = 500, score_tree_interval = score_tree_interval,
stopping_rounds = 5, # early stopping
model_id = paste0("base-", i, ".hex"))
base_models <- c(base_models, list(base_model))
}

message("Create Final Model Data")

final_training_frame <- training_frame
final_validation_frame <- validation_frame
final_x <- x

for(i in base_models){

pred_name <- paste0("pred_", i@parameters$y)
final_x <- c(final_x, pred_name)

# Add Cross Validation Holdout Predictions to Training
train_preds <- h2o.cross_validation_holdout_predictions(i)
train_preds <- train_preds[ ,ncol(train_preds)]
colnames(train_preds) <- pred_name
final_training_frame <- h2o.cbind(final_training_frame, train_preds)

# Add Predictions to Validation
valid_preds <- h2o.predict(i, validation_frame)
valid_preds <- valid_preds[ ,ncol(valid_preds)]
colnames(valid_preds) <- pred_name
final_validation_frame <- h2o.cbind(final_validation_frame, valid_preds)
}

message("Train Final Models")

final_models <- list()
for(i in y){
final_model <- h2o.gbm(x = final_x, y = i,
training_frame = final_training_frame, validation_frame = final_validation_frame,
ntrees = 500, score_tree_interval = score_tree_interval,
stopping_rounds = 5, # early stopping
model_id = paste0("final-", i, ".hex"))
final_models <- c(final_models, list(final_model))
}

names(final_models) <- y

return(list('final_models' = final_models,
'base_models' = base_models))
}

# Score with Multi-Target Stacking
ScoreMultiTargetStacking <- function(base_models, final_model, newdata){

scoring_data <- newdata
final_x <- base_models[[1]]@parameters$x

for(i in base_models){

pred_name <- paste0("pred_", i@parameters$y)
final_x <- c(final_x, pred_name)

# Add Predictions to Scoring Data
preds <- h2o.predict(i, scoring_data)
preds <- preds[ ,ncol(preds)]
colnames(preds) <- pred_name
scoring_data <- h2o.cbind(scoring_data, preds)
}

return(h2o.predict(final_model, scoring_data))
}
```

Building our final models using our new function:

```{r}
# Train Multi-Target Stacking Models
stacking_models <- TrainMultiTargetStacking(train, test, x = predictors, y = c("int_rate", "bad_loan"), seed = 1234)

# Predict with int_rate model
predictions <- ScoreMultiTargetStacking(stacking_models$base_models, stacking_models$final_models$int_rate, test)
head(h2o.cbind(test$int_rate, predictions))
```


## References

* [A Survey on Multi-Output Regression](http://cig.fi.upm.es/articles/2015/Borchani-2015-WDMKD.pdf)
Loading