-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathREADME.Rmd
431 lines (328 loc) · 19.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
---
title: 'POLS 503: Assigment 2'
date: '2017-04-21'
output:
md_document: default
html_document:
includes:
before_body: includes/before_body.html
pdf_document:
includes:
in_header: includes/in_header.tex
keep_tex: yes
bibliography: assignment2.bib
---
This assignment works through an example in @Yule1899a:
@Yule1899a is a published example multiple regression analysis in its modern form.[^yule]
Yule wrote this paper to analyze the effect of policy changes and implementation on pauperism (poor receiving benefits) in England under the [English Poor Laws](https://en.wikipedia.org/wiki/English_Poor_Laws). In 1834, a new poor law was passed that established a national welfare system in England and Wales. The New Poor Law created new administrative districts (Poor Law Unions) to adminster the law. Most importantly, it attempted to standardize the provision of aid to the poor. There were two types of aid provided: in-relief or aid provided to paupers in workhouses where they resided, and out-relief or aid provided to paupers residing at home. The New Poor Law wanted to decrease out-relief and increase in-relief in the belief that in-relief, in particular the quality of life in workhouses, was a deterrence to poverty and an encouragement for the poor to work harder to avoid poverty.
Yule identifies that there are various potential causes of the change in rate of pauperism, including changes in the (1) law, (2) economic conditions, (3) general social character, (4) moral character, (5) age distribution of the population (pg. 250).
He astutely notes the following:
> If, for example, we should find an increase in the proportion of out-relief associated with (1) an increase in the proportion of the aged to the whole population, and also (2) an increase in the rate of pauperism, it might be legitimate to interpret the result in the sense that changes in out-relief and pauperism were merely simultaneous concomitants of changes in the proportion of aged-the change of pauperism not being a direct consequence of the change of administration, but both direct consequenices of the change in age distribution. It is evidently most important that we should be able to decide between two such differenit ilnterpretations of the same facts. This the method I have used is perfectly competernt to do --- @Yule1899a [pg. 250]
[^yule]: See @Freedman_1997, @Stigler1990a, @Stigler2016a, and @Plewis2017a for discussions of @Yule1899a.
# Setup
```{r message=FALSE}
library("tidyverse")
library("modelr")
# devtools::install_github("jrnold/resamplr")
library("resamplr")
```
While only a subset of the original data of @Yule1899a was printed in the article itself, @Plewis2015a reconstructed the orginal data and @Plewis2017a replicated the original paper. This data is included in the package **datums**. This package is not on CRAN, but can be downloaded from github.
**IMPORTANT** install the latest version of **datums** since a few fixes were recently made to the `pauperism` dataset.
```{r}
# devtools::install_github("jrnold/datums")
library("datums")
```
The data for @Yule1899a is split into two data frames: `pauperism_plu` contains data on the Poor Law Unions (PLU), and `pauperism_year`, panel data with the PLU-year as the unit of observation.
```{r}
pauperism <-
left_join(datums::pauperism_plu, datums::pauperism_year,
by = "ID") %>%
mutate(year = as.character(year))
```
The data consist of `r length(unique(pauperism$ID))` PLUs and the years: 1871, 1881, 1891 (years in which there was a UK census).
@Yule1899a is explcitly using regression for causal inference. The outcome variable of interest is:
- **Pauperism** the percentage of the population in receipt of relief of any kind, less lunatics and vagrants
The treatment (policy intervention) is the ration of numbers receiving outdoor relief to those receiving indoor relief.
- **Out-Relief Ratio:** the ratio of numbers relieved outdoors to those relieved indoors
He will control for two variables that may be associated with the treatment
- **Proportion of Old:** the proportion of the aged (65 years) to the whole population since the old are more likely to be poor.
- **Population:** in particular changes in population that may be proxying for changes in the economic, social, or moral factors of PLUs.
There is also **Grouping of Unions**, which is a locational classification based on population density that consists of Rural, Mixed, Urban, and Metropolitan.
Instead of taking differences or percentages, Yule worked with "percent ratio differences", $100 \times \frac{x_{t}}{x_{t-1}}$, because he did not want to work with negative signs, presumably a concern at the because he was doing arithmetic by hand and this would make calculations more tedious or error-prone.
## Original Specification
Run regressions of `pauper` using the yearly level data with the following specifications.
In @Yule1899a, the regressions are
- *M1:* `paupratiodiff ~ outratiodiff + year + Type`
- *M2:* `paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * (year + Type)`
- *M3:* `-1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type)`
- *M4:* `paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type)`
1. Present the regressions results in a regression table
2. Interpret the coefficients for `outratiodiff` for each model.
3. Write the equations for each or all models, and describe the model with a sentence or two. Try to be as concise as possible. Look at recent journal articles for examples of the wording and format.
4. What is the difference between *M3* and *M4*. What are the pros and cons of each parameterization?
4. Conduct F-tests on the hypotheses:
1. All interactions in *M4* are 0
2. The coefficients on `outratiodiff` in *M4* are the same across years
3. The coefficients on `outratiodiff` in *M4* are the same across PLU Types
4. The coefficients on `outratiodiff` in *M4* are the same across PLU Types and years.
You can conduct F-tests with the function `anova(mod_unrestricted, mod_restricted)`.
5. Calculate the predicted value and confidence interval for the PLU with the median value of `outratiodiff`, `popratiodiff`, and `oldratiodiff` in each year and PLU Type for these models.
Plot the predicted value and confidence interval of these as point-ranges.
6. As previously, calculate the predicted value of the median PLU in each year and PLU Type. But instead of confidence intervals include the prediction interval. How do the confidence and prediction intervals differ? What are their definitions?
## Functional Forms
The regression line of the model estimated in @Yule1899a (ignoring the year and region terms and interactions) can be also written as
$$
\begin{aligned}[t]
100 \times \frac{\mathtt{pauper2}_t / \mathtt{Popn2_t}}{\mathtt{pauper2}_{t-1} / \mathtt{Popn2_{t-1}}}
&= \beta_0 + \beta_1 \times 100 \times \frac{\mathtt{outratio}_t}{\mathtt{outratio_{t-1}}} \\
& \quad + \beta_2 \times 100 \times \frac{\mathtt{Popn65}_t / \mathtt{Popn2}_{t}}{\mathtt{Popn65}_{t-1} / \mathtt{Popn2}_{t-1}} + \beta_3 \times 100 \times \frac{\mathtt{Popn2}_t}{\mathtt{Popn2}_{t - 1}}
\end{aligned}
$$
1. Write a model that includes only the log differences ($\log(x_t) - \log(x_{t - 1})$) with only the `pauper2`, `outratio`, `Popn2`, and `Popn65` variables.
2. Estimate the model with logged difference predictors, Year, and month and interpret the coefficient on $\log(outratio_t)$.
3. What are the pros and cons of this parameterization of the model relative to the one in @Yule1899a? Focus on interpretation and the desired goal of the inference rather than the formal tests of the regression. Can you think of other, better functional forms?
## Non-differenced Model
Suppose you estimate the model (*M5*) without differencing,
```
pauper2 ~ outratio + (Popn2 + Prop65) * (year + Type)
```
- Interpret the coefficient on `outratio`. How is this different than model *M2*?
- What accounts for the different in sample sizes in *M5* and *M2*?
- What model do you think will generally have less biased estimates of the effect of out-relief on pauperism: *M5* or *M2*? Explain your reasoning.
## Substantive Effects
Read @Gross2014a and @McCaskeyRainey2015a. Use the methods described in those papers to assess the substantive effects of out-ratio on the rate of pauperism. Use the model(s) of your choosing.
## Influential Observations and Outliers
### Influential Observations for the Regression
For this use *M2*:
1. For each observation, calculate and explain the following:
- hat value (`hatvalues`)
- standardized error (`rstandard`)
- studentized error (`rstudent`)
- Cook's distance (`cooksd`)
2. Create an outlier plot and label any outliers. See the example [here](https://jrnold.github.io/intro-methods-notes/outliers.html#iver-and-soskice-data)
3. Using the plot and rules of thumb identify outliers and influential observations
## Influential Observations for a Coefficient
1. Run *M2*, deleting each observation and saving the coefficient for `outratiodiff`. This is a method called the jackknife. You can use a for loop to do this, or you can use the function `jackknife` in the package [resamplr](https://github.com/jrnold/resamplr).
1. For which observations is there the largest change in the coefficient on `outratiodiff`?
2. Which observations have the largest effect on the estimate of `outratiodiff`?
3. How do these observations compare with those that had the largest effect on the overall regression as measured with Cook's distance?
4. Compare the results of the jackknife to the `dfbeta` statistic for `outratiodiff`
2. @AronowSamii2015a note that the influence of observations in a regression coefficient is different than the the influence of regression observations in the entire regression. Calculate the observation weights for `outratiodiff`.
1. Regress `outratiodiff` on the control variables
2. The weights of the observations are those with the highest squared errors from this regression. Which observations have the highest coefficient values?
3. How do the observations with the highest regression weights compare with those with the highest changes in the regression coefficient from the jackknife?
## Omitted Variable Bias
An informal way to assess the potential impact of omitted variables on the coeficient of the variable of interest is to coefficient variation when covariates are added as a measure of the potential for omitted variable bias [@Oster2016a].
@NunnWantchekon2011a (Table 4) calculate a simple statistic for omitted variable bias in OLS. This statistic "provide[s] a measure to gauge the strength of the likely
bias arising from unobservables: how much stronger selection on unobservables,
relative to selection on observables, must be to explain away the full estimated
effect."
1. Run a regression without any controls. Denote the coefficient on the variable of interest as $\hat\beta_R$.
2. Run a regression with the full set of controls. Denote the coefficient on the variable of interest in this regression as $\hat\beta_F$.
3. The ratio is $\hat\beta_F / (\hat\beta_R - \hat\beta_F)$
Calculate this statistic for *M2* and interpret it.
## Heteroskedasticity
### Robust Standard Errors
1. Run *M2* and *M3* with a heteroskedasticity consistent (HAC) or robust standard error. How does this affect the standard errors on `outratio` coefficients? Use the **sandwich** package to add HAC standard errors [@Zeileis2004a].
### Multiple Regressions
1. Run the model with interactions for all years and types
```r
lm(pauper2 ~ (outratio + Popn2 + Prop65) * year * Type - 1, data = pauperism)
```
2. For each subset of `year` and `type` run the regression
```r
lm(pauper2 ~ outratio + Popn2 + Prop65)
```
3. Compare the coefficients, standard errors, and regression standard errors in these regresions.
To run the multiple regressions, save models as a list column `mod`,
then save the results of `glance` and `tidy` as list columns:
```{r}
all_interact <-
crossing(Type = pauperism$Type, year = c(1881, 1891)) %>%
mutate(mod = map2(year, Type,
function(yr, ty) {
lm(paupratiodiff ~ outratiodiff + popratiodiff + oldratiodiff,
data = filter(pauperism,
year == yr,
Type == ty))
})) %>%
mutate(mod_glance = map(mod, broom::glance),
mod_tidy = map(mod, broom::tidy))
```
Now extract parts of model.
E.g. Standard errors of the regression:
```{r}
all_interact %>%
mutate(sigma = map_dbl(mod_glance, function(x) x$sigma)) %>%
select(year, Type, sigma)
```
## Weighted Regression
1. Run *M2* and *M3* as weighted regressions, weighted by the population (`Popn`) and interpret the coefficients on `outratiodiff` and interactions. Informally assess the extent to which the coefficients are different. Which one does it seem to affect more?
2. What are some rationales for weighting by population? See the discussion in @SolonHaiderWooldridge2013a and @AngristPischke2014a.
## Cross-Validation
When using regression for causal inference, model specification and choice should largely be based on avoiding omitted variables.
Another criteria for selecting models is to use their fit to the data.
But a model's fit to data should not be assessed using only the in-sample data.
That leads to overfitting---and the best model would always be to include an indicator variable for every observation
Instead, a model's fit to data can be assessed by using its out-of-sample fit.
One way to estimate the *expected* fit of a model to *new* data is cross-validation.
We want to compare the predictive performance of the following models
```{r}
mod_formulas <-
list(
m0 = paupratiodiff ~ 1,
m1 = paupratiodiff ~ year + Type,
m2 = paupratiodiff ~ outratiodiff + year + Type,
m3 = paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * (year + Type),
m4 = -1 + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type),
m5 = paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * year * Type
)
```
Let's split the data into 10 (train/test) folds for cross-validation,
```{r}
pauperism_nonmiss <-
pauperism %>%
filter(year %in% c(1881, 1891)) %>%
select(paupratiodiff, outratiodiff, popratiodiff, oldratiodiff, year, Type, Region, ID) %>%
tidyr::drop_na()
pauperism_10folds <-
pauperism_nonmiss %>%
resamplr::crossv_kfold(10)
```
For each model formula `f`, training data set `train`, and test data set, `test`,
run the model specified by `f` on `train`, and predict new observations in `test`, and calculate the RMSE from the residuals
```{r}
mod_rmse_fold <- function(f, train, test) {
fit <- lm(f, data = as.data.frame(train))
test_data <- as.data.frame(test)
err <- test_data$paupratiodiff - predict(fit, newdata = test_data)
sqrt(mean(err ^ 2))
}
```
E.g. for one fold and formula,
```{r}
mod_rmse_fold(mod_formulas[[1]], pauperism_10folds$train[[1]],
pauperism_10folds$test[[1]])
```
Now write a function that will calculate the average RMSE across folds for a formula and a cross-validation data frame with `train` and `test` list-columns:
```{r}
mod_rmse <- function(f, data) {
map2_dbl(data$train, data$test,
function(train, test) {
mod_rmse_fold(f, train, test)
}) %>%
mean()
}
```
```{r}
mod_rmse(mod_formulas[[1]], pauperism_10folds)
```
Finall, we want to run `mod_rmse` for each formula in `mod_formulas`.
It will be easiest to store this in a data frame:
```{r}
cv_results <- tibble(
model_formula = mod_formulas,
.id = names(mod_formulas),
# Formula as a string
.name = map(model_formula,
function(x) gsub(" +", " ", paste0(deparse(x), collapse = "")))
)
```
Use `map` to run `mod_rmse` for each model and save it as a list frame in
the data frame,
```{r}
cv_results <-
mutate(cv_results,
cv10_rmse = map(model_formula, mod_rmse, data = pauperism_10folds))
```
In the case of linear regression, the MSE of the Leave-one-out ($n$-fold) cross-validation can be analytically calculated without having to run $n$ regressions.
```{r}
loocv <- function(x) {
mean((residuals(x) / (1 - hatvalues(x))) ^ 2)
}
```
We
```{r}
cv_results <-
mutate(cv_results,
rmse_loo = map(mod_formulas, function(f) sqrt(loocv(lm(f, data = pauperism_nonmiss)))))
```
1. In the 10-fold cross validation, which model has the best out of sample prediction?
2. Using the LOO-CV cross-validation, which model has the best
3. Does the prediction metric (RMSE) and prediction task---predicting individual PLUs from other PLUs---make sense? Can you think of others that you would prefer?
## Bootstrapping
Estimate the 95% confidence intervals of model with simple non-parametric bootstrapped standard errors. The non-parametric bootstrap works as follows:
Let $\hat\theta$ be the estimate of a statistic. To calculate bootstrapped standard errors and confidence intervals use the following procedure.
For samples $b = 1, ..., B$.
1. Draw a sample with replacement from the data
2. Estimate the statistic of interest and call it $\theta_b^*$.
Let $\theta^* = \{\theta_1^*, \dots, \theta_B^*\}$ be the set of bootstrapped statistics.
- standard error: $\hat\theta$ is $\sd(\theta^*)$.
- confidence interval:
- normal approximation. This calculates the confidence interval as usual but uses the bootstrapped standard error instead of the classical OLS standard error: $\hat\theta \pm t_{\alpha/2,df} \cdot \sd(\theta^*)$
- quantiles: A 95% confidence interval uses the 2.5% and 97.5% quantiles of $\theta^*$ for its upper and lower bounds.
Original model
```{r}
mod_formula <- paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * year * Type
mod_orig <- lm(mod_formula, data = pauperism_nonmiss)
```
```{r}
bs_coef_se <-
resamplr::bootstrap(pauperism_nonmiss, 1024) %>%
# extract the strap column
`[[`("sample") %>%
# run
map_df(function(dat) {
lm(mod_formula, data = dat) %>%
broom::tidy() %>%
select(term, estimate)
}) %>%
# calculate 2.5%, 97.5% and sd of estimates
group_by(term) %>%
summarise(
std.error_bs = sd(estimate),
conf.low_bsq = quantile(estimate, 0.025),
conf.low_bsq = quantile(estimate, 0.975)
)
```
Now compare the std.error of the original and the bootstrap for `outratiodiff`
```{r}
broom::tidy(mod_orig, conf.int = TRUE) %>%
select(term, estimate, std.error) %>%
filter(term == "outratiodiff") %>%
left_join(bs_coef_se, by = "term")
```
The bootstrap standard error is slightly higher.
It is similar to the standard error generated using the heteroskedasticity consistent standard error.
```{r}
sqrt(sandwich::vcovHC(mod_orig)["outratiodiff", "outratiodiff"])
```
It is likely that there is correlation between the error terms of observations.
At the very least, each PLU is included twice; these observations are likely
correlated, so we are effectively overstating the sample size of our data.
One way to account for that is to resample "PLUs", not PLU-years.
This cluster-bootstrap will resample each PLU (and all its observations), rather than resampling the observations themselves.
```{r}
pauperism_nonmiss %>%
group_by(ID) %>%
resamplr::bootstrap(1024) %>%
# extract the strap column
`[[`("sample") %>%
# run
map_df(function(dat) {
lm(mod_formula, data = dat) %>%
broom::tidy() %>%
select(term, estimate)
}) %>%
# calculate 2.5%, 97.5% and sd of estimates
group_by(term) %>%
summarise(
std.error_bs = sd(estimate),
conf.low_bsq = quantile(estimate, 0.025),
conf.low_bsq = quantile(estimate, 0.975)
) %>%
filter(term == "outratiodiff")
```
However, this yields a standard error not much different than the Robust standard error.
1. Try bootstrapping "Region" and "BoothGroup". Do either of these make much difference in the standard errors.
## References