-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathLab 4:14.Rmd
383 lines (286 loc) · 17.4 KB
/
Lab 4:14.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
---
title: 'Lab Session 3: Multiple Regression'
subtitle: 'Yule on Pauperism'
author: "Jeffrey Arnold and Daniel Yoo"
date: "4/14/2017"
output:
html_document: default
pdf_document: default
---
```{r}
library("dplyr")
library("tidyr")
library("tidyverse")
library("modelr")
library("viridis")
library("broom")
library("ggplot2")
library("modelr")
library("viridis")
```
## The Effect of English Poor Laws on Pauperism
[Yule (1899)](https://www.jstor.org/stable/2979889?seq=1#page_scan_tab_contents) is a published example multiple regression analysis in its modern form.[^yule]
Yule wrote this paper to analyze the effect of policy changes and implementation on pauperism (poor receiving benefits) in England under the [English Poor Laws](https://en.wikipedia.org/wiki/English_Poor_Laws). In 1834, a new poor law was passed that established a national welfare system in England and Wales. The New Poor Law created new administrative districts (Poor Law Unions) to adminster the law. Most importantly, it attempted to standardize the provision of aid to the poor. There were two types of aid provided: in-relief or aid provided to paupers in workhouses where they resided, and out-relief or aid provided to paupers residing at home. The New Poor Law wanted to decrease out-relief and increase in-relief in the belief that in-relief, in particular the quality of life in workhouses, was a deterrence to poverty and an encouragement for the poor to work harder to avoid poverty.
Yule identifies that there are various potential causes of the change in rate of pauperism, including changes in the (1) law, (2) economic conditions, (3) general social character, (4) moral character, (5) age distribution of the population (pg. 250).
He astutely notes the following:
*"If, for example, we should find an increase in the proportion of out-relief associated with (1) an increase in the proportion of the aged to the whole population, and also (2) an increase in the rate of pauperism, it might be legitimate to interpret the result in the sense that changes in out-relief and pauperism were merely simultaneous concomitants of changes in the proportion of aged-the change of pauperism not being a direct consequence of the change of administration, but both direct consequenices of the change in age distribution. It is evidently most important that we should be able to decide between two such differenit ilnterpretations of the same facts. This the method I have used is perfectly competernt to do"*
-Yule (1899, pg. 250)
[^yule]: See Stigler(2016) and Stigler(1990) for a discussion.
### Data
The Yule's data on pauperism is included in the package datums at jrnold/datums.
```{r}
devtools::install_github("jrnold/datums")
library(datums)
```
It consists of two datasets: pauperism_plu contains data on the Poor Law Unions, and pauperism_year has the PLU-year as the unit of observation and contains data on the levels of pauperism in 1871, 1881, and 1891 in each PLU.
```{r}
pauperism_plu <- datums::pauperism_plu
pauperism_year <- datums::pauperism_year
glimpse(pauperism_year)
```
There are four variables of primary interest to Yule (pg. 252-254):
**Pauperism**--the percentage of the population in receipt of relief of any kind, less lunatics and vagrants;
**Out-Relief Ratio**--the ratio of numbers relieved outdoors to those relieved indoors;
**Proportion of Old**--the proportion of the aged (65 years) to the whole population;
**Population**--used to capture economic, social, or moral factors.
There is also **Grouping of Unions**, which is a locational classification based on population density that consists of Rural, Mixed, Urban, and Metropolitan.
Instead of taking differences or percentages, Yule worked with "percent ratio differences", $100 \times \frac{x_{t}}{x_{t-1}}$, because he did not want to work with negative signs, presumably a concern at the because he was doing arithmetic by hand and this would make calculations more tedious or error-prone.
We can construct the **Proportion of Old** variable below since it is not included in the datasets.
```{r}
pctratiodiff <- function(x) { #function to compute percent ratio difference
z <- 100 * (x / lag(x))
z[is.infinite(z)] <- NA_real_
z
}
pauperism <- #adjust the dataset by creating Popn65 and Prop65 variables
pauperism_year %>%
mutate(Popn65 = F65 + M65,#total of men and women over 65
Prop65 = Popn65 / Popn,#proportion of total population 65 and above
year = as.integer(year)) %>%
arrange(ID, year) %>%
group_by(ID) %>%
mutate(Prop65_diff = pctratiodiff(Prop65)) %>%#for each PLU, create Prop65_diff using the above function
left_join(pauperism_plu, by = "ID") %>%#merge with the pauperism_plu dataset
filter(Type != "NaN") %>%
ungroup()
select(pauperism, ID, year, Prop65_diff)
```
## Summary Statistics
Using these datasets, we can try to reconstruct the tables of summary statistics in Yule's paper.
### Number of Unions in Group, England Only (Table A, p. 255)
```{r}
pauperism %>%
filter(year >= 1881) %>%
select(ID, year, Type, paupratiodiff, outratiodiff, Prop65_diff, popratiodiff) %>%
drop_na() %>%
count(year, Type)
```
### Metropolitan Group, 1871-1881 (Table XIX, p. 286)
```{r}
filter(pauperism, Type == "Metropolitan") %>%
filter(year == 1881) %>%
select(ID, Union, paupratiodiff, outratiodiff, Prop65_diff, popratiodiff) %>%
arrange(ID) %>%
ungroup() %>%
select(-ID) %>%
knitr::kable()
```
### Table of Means and Standard Deviations of Percentage Ratios (Table 1, pg. 279)
```{r}
pauperism %>%
filter(year >= 1881) %>%
select(year, Type, paupratiodiff, outratiodiff, Prop65_diff, popratiodiff) %>%
drop_na() %>%
gather(variable, value, -year, -Type) %>%
group_by(year, Type, variable) %>%
summarise_at(vars(value), funs(mean, sd)) %>%
knitr::kable()
```
### Table of the Correlation Coefficients (Gross) and their Probable Errors (Table 2, pg. 279)
```{r}
pauperism %>%
filter(year >= 1881) %>%
select(year, Type, paupratiodiff, outratiodiff, Prop65_diff, popratiodiff) %>%
drop_na() %>%
group_by(year, Type) %>%
do({
cor(.[ , c("paupratiodiff", "outratiodiff", "Prop65_diff", "popratiodiff")]) %>%
tidy() %>%
gather(.colnames, value, -.rownames) %>%
filter(.rownames < .colnames) %>%
unite(variable, .rownames, .colnames) %>%
spread(variable, value)
})
```
Yule observes in Table 1 that upon comparing the mean changes in pauperism and out-relief ratio between 1881-1891, there were decreases in pauperism among rural and mixed groups and increases in their out-relief ratio, while in the earlier decade, there was a decrease in both. He therefore concludes that out-relief ratio could not be the only factor influerncing changes in pauperism (pg. 258).
## Multiple Linear Regression
### Specification search and omitted variable bias
Yule was interested in testing the hypothesis that an decrease in the out-relief ratio lowers pauperism. We can therefore express this in the following model:
$$ \begin{aligned}[t] \mathtt{Pauperism} &= \beta_0 + \beta_1\mathtt{OutReliefRatio} + \varepsilon \end{aligned} $$
But, as we discussed, if out-relief ratio is correlated with another variable in the error term that influences pauperism, then our estimate on the out-relief ratio will be biased. For example, if the elderly people are more likely to be receiving out relief, and they are more likely to be paupers, then our estimate of the out-relief ratio will be upward biased.
The error term includes **everything** we have not measured and included in the regression. We know that if anything in the error term that (1) influences the dependent variable and (2) is correlated with the regressions in the equation, then OLS will not give unbiased parameter estimates.
Should include every variable imaginable in the regression? Why not?
* Loss of degrees of freedom
* Loss of variance in covariates after conditioning on others (lowers precision)
* Overspecification and post-treatment bias
Yule's regression equation is specified as follows (pg. 258)
$$ \begin{aligned}[t] \mathtt{Pauperism} &= \beta_0 + \beta_1\mathtt{OutReliefRatio} + \beta_2\mathtt{ProportionOfOld} + \beta_3\mathtt{Population} + \varepsilon \end{aligned} $$
Referring back to Berk (2010), there are five kinds of responses to this kind of problem. Including relevant control variables into the regression equation is a step in the direction of the second response, that the current assortment of regression diagnostics can find serious problems in a regression model and that these problems can then be readily fixed.
We can fit the simple regression of pauper on out-relief ratio, then fit the regression of pauper on Yule's specification. Interpret and compare the results. What do you find?
```{r}
pauperism_diff <-
pauperism %>%
filter(year > 1871) %>%
mutate(year = as.factor(year)) %>%
select(ID, Union, Type, year, paupratiodiff, outratiodiff, popratiodiff, Prop65_diff) %>%
drop_na()
summary(lm(paupratiodiff ~ outratiodiff, data = pauperism_diff))
summary(lm(paupratiodiff ~ outratiodiff + Prop65_diff + popratiodiff, data = pauperism_diff))
```
We can include other covariates or control variables if we suspect they are a source of omitted variable bias. In this example, the type of location and the year may be the source of OMVB. Why might this be? What do they control for?
We can include these as follows:
```{r}
summary(lm(paupratiodiff ~ outratiodiff + popratiodiff + Prop65_diff + Type + year, data = pauperism_diff))
```
What do these results tell us?
We can also include the interactions between type, year, and our explanatory variables.
```{r}
summary(lm(paupratiodiff ~ Type * year + outratiodiff + popratiodiff + Prop65_diff, data = pauperism_diff))
```
How do we interpret the estimates on the interaction terms?
We can also include higher order interaction terms.
```{r}
summary(lm(paupratiodiff ~ Type * (year + outratiodiff + Prop65_diff + popratiodiff), data = pauperism_diff))
```
We can also create histograms of our primary variables of interest to observe their distribution according to year and location type.
### Histogram of Out-Relief Ratio by Location and Year
```{r}
ggplot(select(filter(pauperism_diff, !is.na(outratiodiff)),
outratiodiff, ID, year, Type),
aes(x = outratiodiff, y = ..density..)) +
geom_histogram(binwidth = 20) +
facet_grid(year ~ Type)
```
### Histogram of Pauperism by Location and Year
```{r}
ggplot(select(filter(pauperism_diff, !is.na(paupratiodiff)),
paupratiodiff, ID, year, Type),
aes(x = paupratiodiff, y = ..density..)) +
geom_histogram(binwidth = 15) +
facet_grid(year ~ Type)
```
There appear to be some big outliers in the ratio difference in pauperism. We can try to get a better view of this by plotting the regression, in this case for 1871.
```{r}
datums::pauperism_year %>%
filter(year == 1871) %>%
select(outratio, pauper2) %>%
drop_na() %>%
ggplot(aes(x = outratio, pauper2)) +
geom_point() +
geom_smooth(method = "lm")
datums::pauperism_year %>%
filter(year == 1871) %>%
select(outratio, pauper2) %>%
drop_na() %>%
cor()
```
We can also run this regression below.
```{r}
datums::pauperism_year %>%
group_by(year) %>%
summarise(mod = list(lm(outratio ~ pauper2, data = .)))
datums::pauperism_year %>%
filter(year == 1871) %>%
select(outratio, pauper2) %>%
drop_na() %>%
lm(pauper2 ~ outratio, data = .)
mod_1871 <-
datums::pauperism_year %>%
filter(year == 1871) %>%
select(outratio, pauper2) %>%
{lm(pauper2 ~ outratio, data = .)}
ggplot(augment(mod_1871), aes(x = outratio)) +
geom_ref_line(v = mean(filter(pauperism_year, year == 1871)$outratio, na.rm = TRUE)) +
geom_point(mapping = aes(y = pauper2, size = .hat, colour = .hat)) +
geom_path(mapping = aes(y = .fitted)) +
scale_color_viridis()
```
The observations that have the highest weight in determining the fitted values are those furthest from the mean of X.
We can also view the hat values in a multidimensional space:
```{r}
pauperism_mds <-
pauperism %>%
filter(year == 1891) %>%
select(outratiodiff, popratiodiff, Prop65_diff) %>%
drop_na() %>%
dist() %>%
cmdscale() %>%
tidy()
lm(paupratiodiff ~ outratiodiff + popratiodiff + Prop65_diff,
data = filter(pauperism, year == 1891)) %>%
augment() %>%
bind_cols(pauperism_mds) %>%
ggplot(aes(x = X1, y = X2, size = .hat, colour = .hat)) +
geom_ref_line(h = 0) +
geom_ref_line(v = 0) +
geom_point()
```
How would you deal with these outliers?
### Goodness of Fit and Variable Transformations
#### Goodness of Fit
Goodness of fit tests in multiple regression are analagous to those in simple linear regression. Two of these tests are (1) $R^{2}$ and (2) standard error of the regression.
The familiar $R^{2}$ is the proportion of the variance in $y$ explained by $x$ or
$$R^{2} = \frac{ESS}{TSS} = 1 - \frac{SSR}{TSS}$$
where $R^2$ is bound by 0 and 1. Why is $R^{2}$ potentially misleading? An alternative is the adjusted $R^{2}$, which adds a penalty term to the $R^{2}$:
$$\bar{R}^{2} = 1 - \frac{n-1}{n-k-1}\frac{SSR}{TSS}$$
Adding a regressor decreases SSR but it also lowers the ratio $\frac{n-1}{n-k-1}$.
The standard error of the regression (SER) or the standard error of $\hat{e}_{i}$ is the average deviation from $Y_{i}$ from $\hat{Y}_{i}$:
$$SER^2 = \frac{1}{n-k-1}\sum^{n}_{i=1}\hat{e}_{i}^2=\frac{SSR}{n-k-1}$$
Adding a regressor shrinks the error term, but the SER may rise because $n-k-1$ falls.
Cross validation may be a more persuasive goodness of fit test.
#### Log Transformations
We often use log transformations in our data, in our explanatory variables or outcome variables or both. There are several reasons for this. For one, logging the outcome variable may more closely satisfy the assumptions of the normal linear regression model if its empirical distribution is non-normal. Another reason for this is interpretation:
1. Level - Log
* $y = \beta_{0} + \beta_{1}ln(x) + e$
* A 1% change in $x$ is associated with a $.01 \times \beta_{1}$ change in $y$
2. Log - Level
* $ln(y) - \beta_{0} + \beta_{1}x + e$
* A one unit change in $x$ is associated with $100 \times \beta_{1}%$ change in $y$
3. Log - Log
* $ln(y) = \beta_{0} + \beta{1}ln(x) + e$
* A 1% change in $x$ is associated with a $\beta_{1}%$ change in $y$
#### Polynomials
In some cases, it may seem apparent that fitting a quadratic curve would fit the data better. OLS is a linear estimator, but this means that it is linear in parameters not variables. We may want to fit:
$$y = \beta_{0} + \beta_{1}x_{1} \times x_{2} + \beta_{2}x_{1}^{x_3} + \beta_{3}\frac{x_2}{x_3} + e$$
We can just rewrite as
$$y = \beta_{0} = \beta_{1}x_{4} + \beta_{2}x_{5}+\beta_{3}x_{6} + e$$
where $x_{4}\equiv x_{1} \times x_{2}$, $x_{5} \equiv x_{1}^{x_3}$, and $x_{6} \equiv \frac{x_2}{x_3}$
When adding higher order terms, such as in $y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{1}^{2} + e$, the slope of the regression line now depends on our starting point. In non-linear specification, you must consider the entire function to find the marginal effect of $x$ on $y$. It does not make sense to interpret the increase in $x$ while holding $x^2$ constant.
One important caveat is that searching over multiple variable polynomials amounts to data mining or curve fitting. Significance tests may suggest adding polynomials, but this not does justify their inclusion. Higher order polynomials will always fit the sample well, but may not fit the population well
### Multiple regression anatomy
$$ \beta_k = \frac{Cov(Y_i, \tilde{X}_{k,i})}{Var(\tilde{X}_{x,i})} $$ where $\tilde{X}_{k,i}$ is the residual of the regression of $X$ on all the variables other than $k$.
```{r}
pauperism_diff <- select(pauperism, paupratiodiff, outratiodiff,
Prop65_diff, popratiodiff) %>% drop_na()
mod_prop65_diff <- lm(paupratiodiff ~ Prop65_diff, data = pauperism_diff)
mod_prop65_diff
pauperism_diff$.fitted1 <- fitted(mod_prop65_diff)
pauperism_diff$.resid1 <- residuals(mod_prop65_diff)
mod_popratiodiff <- lm(popratiodiff ~ Prop65_diff, data = pauperism_diff)
mod_popratiodiff
pauperism_diff$.fitted2 <- fitted(mod_popratiodiff)
pauperism_diff$.resid2 <- residuals(mod_popratiodiff)
lm(pauperism_diff$.resid1 ~ pauperism_diff$.resid2)
lm(paupratiodiff ~ popratiodiff + Prop65_diff, data=pauperism_diff)
```
1. Regress $y$ on $x_2$, get residuals $\hat{r}^{y}_{x_2}$
2. Regress $x_1$ on $x_2$, get residuals $\hat{r}^{x_1}_{x_2}$
3. Regress $\hat{r}^{y}_{x_2}$ on $\hat{r}^{x_1}_{x_2}$ and get $\hat\beta_{1}$
The Frisch-Waugh-Lovell theorem shows us this is always true in OLS.
## Yule's conclusion
Yule concluded that:
*Changes in rates of total pauperism always exhibit marked correlation with changes in out-relief ratio, but very little correlation with changes in population or in proportion of old in the different unions.*
*Changes in out-relief ratio exhibit no correlation one way or the other with changes of population or proportion of old.*
*It seems impossible to attribute the greater part, at all events, of the observed correlation between changes in pauperism and changes in out-relief ratio to anything but a direct influence of change of policy on change of pauperism, the change in policy not being due to any external causes such as growth of population or economic changes.*
*Assuming such a direct relation, it appears that some five-eighths of the decrease of pauperism during 1871-81 was due to chaniged policy. The decrease during 1881-91 cannot be so accounted for, policy having scarcely changed during that decade.*
What are your thoughts?