-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathknowyourneighbr_algo.html
134 lines (115 loc) · 5.26 KB
/
knowyourneighbr_algo.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
<!DOCTYPE HTML>
<html>
<head>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-46774844-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
<title>Steven Ellis</title>
<link rel="stylesheet" type="text/css" href="self.css">
</head>
<link rel="icon"
type="image/png"
href="/favicon.png">
<link rel="icon"
type="image/png"
href="favicon.png">
<body>
<a href="index.html"><img id="logo" src="ideographic_description_character_overlaid.svg" type="image/svg+xml">
</img></a>
<div id="wrapdiv">
<div id="links"><h2>
<a href="index.html">Profile</a>
<a id="sel" href="research.html">Research</a>
<a href="visualization.html">Visualization</a>
</h2></div>
<br>
<div id="story">
<h1>KnowYourNeighbr</h1>
<p>KnowYourNeighbr is an ensemble modeling approach combining a blackbox model and a whitebox model. This approach allows for high accuracy, often the domain of blackbox models, to be augmented with interpretability, generally the domain of whitebox models. In most applications, I have used information from an initial XGBoost step to improve a second multivariate matching or KNN step. I have also combined the results of these two steps, as in a form of model stacking. The below will describe the non-stacking approach -- a stacking approach would simply go one step further and combine the results of the two models.</p>
<a href="modeling.png"><div class="image" style="text-align:center;"><img src="knowyourneighbr_fig_1.png" width="500"></div></a>
<p>I have used this approach with both matching outcomes (generation of synthetic control, either for post-hoc inference or prospective experimental group assignment) and neighbor outcomes (either/both of generation of a knn prediction, display of selected neighbors). A matching approach underlies the (linked) <a href="acu_brfss.html">"Does Community Acupuncture Ameliorate Chronic Illness"</a> post.</p>
<p>The utility of such an approach is in creating output in the form of raw data -- rows of your original inputs -- which are optimally collated with regards to the importance of each column (or predictor variable) in predicting your outcome variable.</p>
<p>This can be useful when:
</p><p>1) Data structures suggest applicability of tree-based learning:
</p><ol>
<li>complex non-linear interactions</li>
<li>theory-driven hierarchical structure</li>
</ol>
<p>and there exists:
</p>
<p>2) Potential benefit from match generation:</p>
<ol><li>synthetic controls</li>
<li>interpretable measurements of bias between/across variables</li>
</ol>
<p>3) Potential benefit from neighbor generation:</p>
<ol><li>end-user interpretability via neighbor display</li>
<li>internal (or external) quality-control (including regulatory oversight)</li>
</ol>
<h2>Process:</h2>
<p>First, generate a tree-based model:</p>
<p>Using an approach such as classification/regression trees, random forest, or xgboost, generate a model which predicts your outcome variable based on your predictor variables.</p>
<code>library(xgboost)
library(data.table)
library(caret)
dat_trva <- data.table(input)
tr_rows <- createDataPartition(input$outcome,
p = .8, list = FALSE, times = 1)
va_rows <- setdiff(seq(nrow(dat_trva)),tr_rows)
tr_dm <- data.matrix(dat_trva[tr_rows,])
tr_lab <- dat_trva[tr_rows,outcome]
tr_packaged <- xgb.DMatrix(tr_dm,label=tr_lab)
va_dm <- data.matrix(dat_trva[va_rows,])
va_lab <- dat_trva[va_rows,outcome]
va_packaged <- xgb.DMatrix(va_dm,label=va_lab)
tr_va_xgb_m <- xgb.train(
objective = "reg:squarederror",
eta = .1,
early_stopping_rounds = 100,
nrounds = 1000,
data = tr_packaged,
max_depth = 10,
print_every_n = 50,
watchlist=list(train=tr_packaged,
validate=va_packaged)
)
</code>
<p>Second, extract feature importances:</p>
<p>Feature importances are encoded in the model object, which we access via a function.</p>
<code>(model_imp <- xgb.importance(model=tr_va_xgb_m))
imp_val <- model_imp$Gain
imp_vars <- model_imp$Feature
</code>
<p>Third, normalize and scale non-outcome variables according to feature importances:</p>
<code>dat_trva_scaled <- scale(data.matrix(dat_trva))
dat_trva_scaled[is.na(dat_trva_scaled)] <- 0
dat_trva_scaled <- dat_trva_scaled[
seq(nrow(dat_trva)),]
tr_x <- dat_trva_scaled[tr_rows,]
va_x <- matrix(dat_trva_scaled[va_rows,],
nrow=sum(va_rows))
tr_va_x <- dat_trva_scaled
to_diag <- diag(apply(xgb_m_imp[,c(3,4)],1,max))
col_matchup <- match(xgb_m_imp$Feature,
colnames(tr_x))
tr_x <- tr_x[,col_matchup] %*% to_diag
va_x <- va_x[,col_matchup] %*% to_diag
tr_va_x <- tr_va_x[,col_matchup] %*% to_diag
</code>
<p>Fourth, apply your matching or knn model:</p>
<p>NB: to achieve this outcome, I needed to modify the R package FNN's default knn.reg function, to enable access to neighbor metadata</p>
<code>attr(res, "nn.index")<- matrix(Z$nn.index, ncol=k);
attr(res, "nn.dist")<- matrix(Z$nn.dist, ncol=k);
kr <- knn.reg(train = tr_x,
test = va_x,
y = dat_trva[tr_rows,price_col],
k = k)
kr_resid_centered <- kr$pred/dat_trva[va_rows,
price_col]-1
</code>