-
Notifications
You must be signed in to change notification settings - Fork 992
Expand file tree
/
Copy pathmultitarget_stacking.Rmd
More file actions
297 lines (218 loc) · 13 KB
/
multitarget_stacking.Rmd
File metadata and controls
297 lines (218 loc) · 13 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "Multi-Target Stacking"
author: "Megan Kurka"
date: "August 17, 2018"
output:
md_document:
variant: markdown_github
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, error = FALSE)
```
## Multi-Target Stacking
Multi-Target stacking is a process used to predict multiple columns. With typical machine learning approaches, a single target column is selected. In this example, we will try to predict the columns `bad_loan` and `int_rate` using our cleaned lending club data: https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv.
The goal of this use case is to determine who will pay back their loan payments and for those who will pay back their loan payments - what should be the interest rate. This would typically be done by training two models:
* predict the target column: `bad_loan`
* predict the target column: `int_rate`
We can assume that `bad_loan` and `int_rate` are highly correlated and the information about `int_rate` may help us predict `bad_loan` and vice versa. Rather than split the problem up into two models, we propose a multi-target stacking approach.
## Multi-Target Stacking in H2O-3
Multi-Target stacking is performed in three steps:
1. Train base models on each target column
2. Extract cross validation predictions to create final model data
3. Train final models for each target using the cross validation predictions as features
### Train Base Models
We will train a base model for each target encoding using cross validation.
```{r, results = "hide"}
library('h2o')
h2o.init()
h2o.no_progress()
df <- h2o.importFile("https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv")
df$bad_loan <- as.factor(df$bad_loan)
```
We will randomly split the data into 75% training and 25% testing. We will use the testing data to evaluate how well the model performs.
```{r}
# Split Frame into training and testing
splits <- h2o.splitFrame(df, seed = 1234, destination_frames=c("train.hex", "test.hex"), ratios = 0.75)
train <- splits[[1]]
test <- splits[[2]]
```
In this next step, we train a model for each target column. We will set `keep_cross_validation_predictions = TRUE` so that we can use the predictions later for our final models.
```{r}
predictors <- c("loan_amnt", "emp_length", "annual_inc", "dti", "delinq_2yrs", "revol_util", "total_acc",
"longest_credit_length", "verification_status", "term", "purpose", "home_ownership")
# Train base model to predict bad_loan
gbm_bad_loan_base <- h2o.gbm(x = predictors, y = "bad_loan",
training_frame = train, nfolds = 5, # 5-fold cross validation
keep_cross_validation_predictions = TRUE,
score_each_iteration = TRUE, ntrees = 500,
seed = 1234,
stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.001,
model_id = "gbm-bad_loan-base.hex")
# Train base model to predict int_rate
gbm_int_rate_base <- h2o.gbm(x = predictors, y = "int_rate",
training_frame = train, nfolds = 5, # 5-fold cross validation
keep_cross_validation_predictions = TRUE,
score_each_iteration = TRUE, ntrees = 500,
seed = 1234,
stopping_rounds = 5, stopping_metric = "MAE", stopping_tolerance = 0.001,
model_id = "gbm-int_rate-base.hex")
```
What would our model performance be if we knew `int_rate` at the time of `bad_loan` and vice versa? If these additional features help our model performance, this could indicate that Multi-Target Stacking can help.
```{r}
# Train model to predict bad_loan using int_rate
gbm_bad_loan_all_x <- h2o.gbm(x = c(predictors, "int_rate"), y = "bad_loan",
training_frame = train, nfolds = 5, # 5-fold cross validation
keep_cross_validation_predictions = TRUE,
score_each_iteration = TRUE, ntrees = 500,
seed = 1234,
stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.001,
model_id = "gbm-bad_loan-all_x.hex")
# Train model to predict int_rate using bad_loan
gbm_int_rate_all_x <- h2o.gbm(x = c(predictors, "bad_loan"), y = "int_rate",
training_frame = train, nfolds = 5, # 5-fold cross validation
keep_cross_validation_predictions = TRUE,
score_each_iteration = TRUE, ntrees = 500,
seed = 1234,
stopping_rounds = 5, stopping_metric = "MAE", stopping_tolerance = 0.001,
model_id = "gbm-int_rate-all_x.hex")
```
The performance metrics on the testing data is shown below:
```{r, echo = FALSE}
# Get Performance Metrics
performance_comparison <- data.frame('Model' = c("baseline", "all predictors"),
'bad_loan_AUC'=c(h2o.auc(h2o.performance(gbm_bad_loan_base, test)),
h2o.auc(h2o.performance(gbm_bad_loan_all_x, test))),
'int_rate_MAE' = c(h2o.mae(h2o.performance(gbm_int_rate_base, test)),
h2o.mae(h2o.performance(gbm_int_rate_all_x, test))))
```
```{r, echo = FALSE}
library('knitr')
performance_comparison$bad_loan_AUC <- round(performance_comparison$bad_loan_AUC, 4)
performance_comparison$int_rate_MAE <- round(performance_comparison$int_rate_MAE, 4)
kable(performance_comparison, row.names = FALSE, col.names = c("Model", "AUC: bad_loan", "MAE: int_rate"))
```
We have much better performance in predicting `bad_loan` when we know the loan's interest rate. Likewise, we have much better performance in predicting `int_rate` when we know if the loan will be fully paid off. This indicates that stacking our base model predictions may help improve performance.
### Create Final Model Data
Now that we have our base models, we can add our cross validation hold out predictions to our training data. We cannot simply predict using the base models on the training data because then we were in danger of overfitting. Any prediction we use should be on some hold-out data.
We will extend our dataset with the holdout predictions and use this extended data to train our final models.
```{r}
bad_loan_preds <- h2o.cross_validation_holdout_predictions(gbm_bad_loan_base)$p1
colnames(bad_loan_preds) <- c("bad_loan_pred")
int_rate_preds <- h2o.cross_validation_holdout_predictions(gbm_int_rate_base)$predict
colnames(int_rate_preds) <- c("int_rate_pred")
ext_train <- h2o.cbind(train, bad_loan_preds, int_rate_preds)
head(ext_train)
```
We can add these same predictions to our test data. Note that for our test dataset, we do not need to use the holdout cross validation predictions. Since the test data was not seen during training, we can simply predict using our base models on the test data to get our additional columns.
```{r}
bad_loan_preds <- h2o.predict(gbm_bad_loan_base, test)$p1
colnames(bad_loan_preds) <- c("bad_loan_pred")
int_rate_preds <- h2o.predict(gbm_int_rate_base, test)$predict
colnames(int_rate_preds) <- c("int_rate_pred")
ext_test <- h2o.cbind(test, bad_loan_preds, int_rate_preds)
```
### Train Final Models
Now that we have our extended training and testing data with predictions for `bad_loan` and `int_rate`, we can train our final models.
Our final models are the same as our base models, however, the `bad_loan` model has the additional feature: `int_rate_pred` and the `int_rate` model has the additional feature: `bad_loan_pred`.
```{r}
gbm_bad_loan_final <- h2o.gbm(x = c(predictors, "int_rate_pred"), y = "bad_loan",
training_frame = ext_train, validation_frame = ext_test,
ntrees = 500, score_each_iteration = TRUE,
stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.001,
model_id = "gbm-bad_loan-final.hex")
gbm_int_rate_final <- h2o.gbm(x = c(predictors, "bad_loan_pred"), y = "int_rate",
training_frame = ext_train, validation_frame = ext_test,
ntrees = 500, score_each_iteration = TRUE,
stopping_rounds = 5, stopping_metric = "MAE", stopping_tolerance = 0.001,
model_id = "gbm-int_rate-final.hex")
```
The performance metrics on the testing data is shown below:
```{r, echo = FALSE}
# Get Performance Metrics
performance_comparison <- data.frame('Model' = c("baseline", "multi-target_stacking"),
'bad_loan_AUC'=c(h2o.auc(h2o.performance(gbm_bad_loan_base, test)),
h2o.auc(h2o.performance(gbm_bad_loan_final, ext_test))),
'int_rate_MAE' = c(h2o.mae(h2o.performance(gbm_int_rate_base, test)),
h2o.mae(h2o.performance(gbm_int_rate_final, ext_test))))
```
```{r, echo = FALSE}
performance_comparison$bad_loan_AUC <- round(performance_comparison$bad_loan_AUC, 4)
performance_comparison$int_rate_MAE <- round(performance_comparison$int_rate_MAE, 4)
kable(performance_comparison, row.names = FALSE, col.names = c("Model", "AUC: bad_loan", "MAE: int_rate"))
```
We can see that the performance improves when we use the Multi-Target Stacking method compared to simply training a model per target.
## Putting It All Together
We will put the steps together into one function that trains Multi-Target Stacking and one function that scores Multi-Target Stacking.
```{r}
# Train Multi-Target Stacking
TrainMultiTargetStacking <- function(training_frame, validation_frame, x, y,
nfolds = 5, seed = -1, score_tree_interval = 1){
message("Train Base Models")
base_models <- list()
for(i in y){
base_model <- h2o.gbm(x = x, y = i,
training_frame = training_frame,
nfolds = nfolds, seed = seed, keep_cross_validation_predictions = TRUE,
ntrees = 500, score_tree_interval = score_tree_interval,
stopping_rounds = 5, # early stopping
model_id = paste0("base-", i, ".hex"))
base_models <- c(base_models, list(base_model))
}
message("Create Final Model Data")
final_training_frame <- training_frame
final_validation_frame <- validation_frame
final_x <- x
for(i in base_models){
pred_name <- paste0("pred_", i@parameters$y)
final_x <- c(final_x, pred_name)
# Add Cross Validation Holdout Predictions to Training
train_preds <- h2o.cross_validation_holdout_predictions(i)
train_preds <- train_preds[ ,ncol(train_preds)]
colnames(train_preds) <- pred_name
final_training_frame <- h2o.cbind(final_training_frame, train_preds)
# Add Predictions to Validation
valid_preds <- h2o.predict(i, validation_frame)
valid_preds <- valid_preds[ ,ncol(valid_preds)]
colnames(valid_preds) <- pred_name
final_validation_frame <- h2o.cbind(final_validation_frame, valid_preds)
}
message("Train Final Models")
final_models <- list()
for(i in y){
final_model <- h2o.gbm(x = final_x, y = i,
training_frame = final_training_frame, validation_frame = final_validation_frame,
ntrees = 500, score_tree_interval = score_tree_interval,
stopping_rounds = 5, # early stopping
model_id = paste0("final-", i, ".hex"))
final_models <- c(final_models, list(final_model))
}
names(final_models) <- y
return(list('final_models' = final_models,
'base_models' = base_models))
}
# Score with Multi-Target Stacking
ScoreMultiTargetStacking <- function(base_models, final_model, newdata){
scoring_data <- newdata
final_x <- base_models[[1]]@parameters$x
for(i in base_models){
pred_name <- paste0("pred_", i@parameters$y)
final_x <- c(final_x, pred_name)
# Add Predictions to Scoring Data
preds <- h2o.predict(i, scoring_data)
preds <- preds[ ,ncol(preds)]
colnames(preds) <- pred_name
scoring_data <- h2o.cbind(scoring_data, preds)
}
return(h2o.predict(final_model, scoring_data))
}
```
Building our final models using our new function:
```{r}
# Train Multi-Target Stacking Models
stacking_models <- TrainMultiTargetStacking(train, test, x = predictors, y = c("int_rate", "bad_loan"), seed = 1234)
# Predict with int_rate model
predictions <- ScoreMultiTargetStacking(stacking_models$base_models, stacking_models$final_models$int_rate, test)
head(h2o.cbind(test$int_rate, predictions))
```
## References
* [A Survey on Multi-Output Regression](http://cig.fi.upm.es/articles/2015/Borchani-2015-WDMKD.pdf)