forked from grantmcdermott/R-intro
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathregression-intro.Rmd
320 lines (224 loc) · 15.4 KB
/
regression-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
title: "Regression intro"
author:
name: Grant R. McDermott & Ed Rubin
affiliation: University of Oregon
# email: [email protected]
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
theme: flatly
highlight: haddock
# code_folding: show
toc: yes
toc_depth: 3
toc_float: yes
keep_md: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = FALSE)
```
# Packages
In the [previous section](https://raw.githack.com/grantmcdermott/R-intro/master/rIntro.html), we learned about the importance of packages in R. We're going to be using several packages for this section; both for analysis and the built-in datasets that they provide. None of these are strictly necessary. "Base" R provides all the support you need for basic regression analysis. However, these packages will make it easier and more fun. You can install them all as follows:
```{r, eval=FALSE}
install.packages(c("tidyverse", "hrbrthemes", "estimatr", "lfe", "huxtable", "margins"))
```
Once that's done, let's start by loading the `tidyverse`, which is really a bunch of different packages bundled together. We're going to be using the starwars data frame (just for you Ben!), which also comes bundled together with the `tidyverse`.
```{r, message=FALSE}
library(tidyverse)
starwars
```
# Regression basics: The `lm()` function
To run an OLS regression in R, we use the `lm()` function that gets automatically loaded with the base `stats` package. The "lm" stands for "**l**inear **m**odels" and running a regression follows a pretty intuitive syntax.^[Indeed, all other regression packages in R that I'm aware of --- including those that allow for much more advanced and flexible models --- closely follow the `lm` syntax.]
```r
lm(y ~ x1 + x2 + x3 + ....)
```
Let's run a simple bivariate regression of starwars characters' mass on their height.
```{r}
ols1 <- lm(mass ~ height, data = starwars)
# ols1 <- lm(starwars$mass ~ starwars$height) ## Also works
ols1
```
The resulting object is pretty terse, but that's only because it buries most of its valuable information --- of which there is a lot --- within its internal list structure. You can use the `str()` function to view this structure.
```{r}
str(ols1)
```
So we see that this `ols1` object has a bunch of important slots, containing everything from the regression coefficients, to vectors of the residuals and fitted (i.e. predicted) values, to the design matric rank, to the input data, etc. etc. To summarise the key pieces of information, we can use the --- wait for it --- generic `summary` function. This will look pretty similar to the default regression output from Stata that many of you will be used to.
```{r}
summary(ols1)
```
We can then dig down further by extracting a summary of the regression coefficients:
```{r}
summary(ols1)$coefficients
```
# Get "tidy" regression coefficients with the `broom` package
While I've just shown you how to extract regression coefficients via the `summary` function, in practice I always use the `broom` package to do so. This package has a bunch of neat features to convert regression (and other statistical) objects into "tidy" data frames. This is especially useful because regression output is so often used as an input to something else, e.g. a plot of coefficients / marginal effects. Here I use the `broom::tidy()` function.
```{r}
library(broom)
tidy(ols1, conf.int = T)
```
Another useful function is `broom::glance()`, which summarises the model "meta" data (R<sup>2</sup>, AIC, etc.) in a data frame.
```{r}
glance(ols1)
```
# Wrangling and plotting our data
Our simple model isn't particularly good; our R<sup>2</sup> is only `r I(round(glance(ols1)$r.squared, 3))`. Different species and homeworlds aside, we may have an extreme outlier in our midst...
```{r}
library(hrbrthemes) ## This package just provides the "theme_ipsum" plotting theme that I like
starwars %>%
ggplot(aes(x=height, y=mass)) +
geom_point(alpha=0.5) +
geom_point(
data = starwars %>% filter(mass==max(mass, na.rm=T)),
col="red"
) +
geom_text(
aes(label=name),
data = starwars %>% filter(mass==max(mass, na.rm=T)),
col="red", vjust = 0, nudge_y = 25
) +
labs(title = "Spot the outlier...") +
theme_ipsum()
```
You might already have noticed it from the above code chunk, but R (through the `tidyverse`) makes it really easy to wrangle data. One particularly nice feature is the pipe operator: `%>%`. This easily lets us combine objects and functions together in a way that is much easier to read than standard code. For example:
```{r}
starwars %>%
arrange(desc(mass)) %>%
select(name, mass, height)
```
Maybe we should exclude Jabba from our regression? Remember that we can also keep multiple objects in memory in R, so we can just create a new data frame that excludes him using the `filter` command.
```{r}
starwars2 <-
starwars %>%
filter(name != "Jabba Desilijic Tiure")
# filter(!(grepl("Jabba", name))) ## Regular expressions also work
ols2 <- lm(mass ~ height, data = starwars2)
summary(ols2)
```
Running a regression directly on a subsetted data frame is equally easy.
```{r}
ols2a <- lm(mass ~ height, data = starwars %>% filter(!(grepl("Jabba", name))))
summary(ols2a)
```
The overall model fit is much improved by the exclusion of this outlier, with R<sup>2</sup> increasing to `r I(round(glance(ols2)$r.squared, 3))`.
# Robust and clustered standard errors
What about robust or clustered standard errors? Well, there are *lots* of ways to get these in R. However, my prefered way these days is to use the [`estimatr` package](https://declaredesign.org/r/estimatr/articles/getting-started.html). Let's illustrate with the `ols1` object that we created earlier (which still includes the crazy Jabba outlier).
```{r}
library(estimatr)
ols1_robust <- lm_robust(mass ~ height, data = starwars)
tidy(ols1_robust, conf.int = T)
```
The `estimatr` package also supports clustered standard errors. I'll return to the issue of (multiway) clustering in the section on panel regression [further below](#high-dimensional_fixed_effects_and_(multiway)_clustering), but here's a quick example just to illustrate:
```{r}
ols1_robust_clustered <- lm_robust(mass ~ height, data = starwars, clusters = homeworld)
tidy(ols1_robust_clustered, conf.int = T)
```
Finally, you can even be explicit about using Stata robust standard errors if you want to replicate code from that language. (See [here](https://declaredesign.org/r/estimatr/articles/stata-wls-hat.html) for more details on why this isn't the default and why Stata's robust standard errors differ from those in R and Python.)
```{r}
ols1_robust_stata <- lm_robust(mass ~ height, data = starwars, se_type = "stata")
tidy(ols1_robust_stata, conf.int = T)
```
# Fixed effects (and dummy variables)
Manually excluding outliers is often a risky strategy (overfitting, etc.). Maybe we should use some fixed effects instead? Again, a manual inspection of the plotted data suggests this could be useful... although the lack of observations per individual species doesn't make this a very robust model.
```{r}
starwars %>%
filter(!(grepl("Jabba", name))) %>%
ggplot(aes(x=height, y=mass, col=species)) +
geom_point(alpha=0.5) +
scale_colour_viridis_d() +
theme_ipsum()
```
## Dummy variables as *factors*
The simplest (and least efficient) way to include fixed effects in a regression model is, of course, to use dummy variables. Compared to other statistical lanaguages, R has a very convenient framework for evaluating dummy variables in a regression: You simply specify the variable of interest as a factor. R will take care of everything else for you.^[No need to tabulate/append a whole new matrix of binary variables.]
```{r}
starwars$species <- as.factor(starwars$species)
ols3 <- lm(mass ~ height + species, data = starwars)
coefs3 <- tidy(ols3, conf.int = T)
summary(ols3)
```
**Aside:** In fact, we didn't even need to specify that the "species" variable was a factor. R is smart enough to realise that any string variable (i.e. text) *must* be a factor if it is going to be interpretted sensibly in a regression. You can confirm this for yourself by converting "species" back to a character (`starwars$species <- as.character(starwars$species)`) and re-running the above regression.
Ignoring the modelling problems that I mentioned above (that insane R<sup>2</sup> is a clear sign we're overfitting because of small within-group samples), this approach works well enough. However, it isn't very efficient or scaleable. What's the point learning all that stuff about the Frisch-Waugh-Lovell theorem, within-group transformations, etcetera, etcetera if we can't use them in our software routines?
## Fixed effects with the `lfe` package
One of my favourite packages in the entire R catalogue is `lfe` ("linear fixed effects"). This package has a tonne of options built in to it (instrumental variables, multi-level clustering, etc.) It can also be used to run simple linear regressions *a la* `lm`. The main functionality, however, is for running fixed effects regressions via the `lfe::felm()` function.^[There are other packages for running panel regressions in R, in particular the `plm` package. However, I think that `lfe` supersedes these in virtually all aspects.]
```{r, message=FALSE}
library(lfe)
ols4 <- felm(mass ~ height | species, data = starwars) ## Fixed effect(s) go after the "|"
coefs4 <- tidy(ols4, conf.int = T)
summary(ols4)
```
Note that the resulting `felm` object drops all of the species intercepts, since it has abstracted them away as fixed effects. Let's confirm that our main coefficient on "height"" is the same across this and the previous model. (Note the different indexing. Why is that?)
```{r}
all.equal(coefs3$estimate[2], coefs4$estimate[1])
```
We could also have used some `dlpyr` syntax, which is more verbose but perhaps easier to read (and less prone to indexing errors).
```{r}
all.equal(
coefs3 %>% filter(term == "height") %>% pull(estimate),
coefs4 %>% filter(term == "height") %>% pull(estimate)
)
```
## High-dimensional fixed effects and (multiway) clustering
One reason that I prefer the `lfe` package to other options --- e.g. the panel data-focused `plm` package --- is because it supports high dimensional fixed effects *and* (multiway) clustering.^[It is very similar to the excellent [reghdfe](http://scorreia.com/software/reghdfe/) package in Stata.] In the below example, I'm going to add "homeworld" as an additional fixed effect to the model and also cluster according to this model. I'm not claiming that this is a particularly good or sensible model, but just go with it. (Maybe the scales of different homeworlds are similarly biased??) Note that, since we specify "homeworld" in the fixed effects slot below, `felm()` automatically converts it to a factor even though we didn't explicitly tell it to.
```{r}
ols5 <-
felm(
mass ~ height |
species + homeworld | ## Two fixed effects go here after the first "|"
0 | ## This is where your IV equation goes, but we put 0 since we aren't instrumenting.
homeworld, ## The final slot is where we specify our cluster variables
data = starwars)
coefs5 <- tidy(ols5, conf.int = T)
coefs5
```
Visually, we can easily compare changes in the coefficients across models thanks to the fact that we saved the output in data frames with `broom::tidy()` above.
```{r}
bind_rows(
coefs4 %>% mutate(reg = "Model 4 (LDFE and no clustering"),
coefs5 %>% mutate(reg = "Model 5 (HDFE and clustering")
) %>%
ggplot(aes(x=reg, y=estimate, ymin=conf.low, ymax=conf.high)) +
geom_pointrange() +
labs(Title = "Marginal effect of height on mass") +
theme_ipsum() +
theme(axis.title.x = element_blank())
```
Normally we expect our standard errors to blow up with clustering, but here that effect appears to be outweighted by the increased precision brought on by additional fixed effects. (As suggested earlier, our level of clustering probably doesn't make much sense either.)
# Other topics
## Interaction terms
Like dummy variables, R provides a convenient syntax for specifying interaction terms directly in the regression model without having to create them manually beforehand.^[Although there are very good reasons that you might want to modify your parent variables before doing so (e.g. centering them). As it happens, I'm [on record](https://twitter.com/grant_mcdermott/status/903691491414917122) as stating that interaction effects are most widely misunderstood and misapplied concept in econometrics. However, that's a topic for another day. (Read the paper in the link!)] You can just use `x1:x2` (to include only the interaction term) or `x1*x2` (to include the parent terms and interaction terms). Generally speaking, you are best advised to include the parent terms alongside an interaction term. This makes the `*` option a good default.
```{r}
humans <- starwars %>% filter(species=="Human")
ols6 <-
lm(
mass ~ gender*height,
data = humans
)
summary(ols6)
```
## Marginal effects
Caculating marginal effect in a regression is utterly straightforward in cases where there are no non-linearities; just look at the coefficient values! However, that quickly goes out the window when have interaction effects, probit or logit models, etc. Luckily, the `margins` package (which is modeled on its namesake in Stata) goes a long way towards automating the process. You can read more in the [package vignette](https://cran.r-project.org/web/packages/margins/vignettes/Introduction.html), but here's a very simple example to illustrate:
```{r}
library(margins)
# ols6 %>% margins() %>% summary() ## Piping also works
summary(margins(ols6))
```
If we want to compare marginal effects at specific values --- e.g. how the ME of height on mass differs across genders --- then that's easily done too.
```{r}
summary(margins(ols6, at = list(gender = c("male", "female"))))
```
You can also plot it using `margins::cplot()`:
```{r}
cplot(ols6, x="gender", dx="height")
```
In this case,it doesn't make much sense to read a lot into the larger standard errors on the female group; that's being driven by a very small sub-sample size.
One downside that I want to highlight briefly is that the `margins` package does [not yet work](https://github.com/leeper/margins/issues/73) with `lfe::felm` objects. There are [potential ways](https://stackoverflow.com/questions/30491545/predict-method-for-felm-from-lfe-package) around this, or you can just calculate the marginal effects manually, but it's admittedly a pain.
## Probit, logit and other generalized linear models
See `?stats::glm`.
## Exporting regression results and descriptive tables (LaTeX, etc.)
There are a loads of different options here. I've historically favoured the `stargazer` package (see [here](https://www.jakeruss.com/cheatsheets/stargazer/)), but I also like `pixiedust` (which [parallels](https://cran.r-project.org/web/packages/pixiedust/) the `broom` package) and `huxtable` is a newer package that looks very promising too (see [here](https://hughjonesd.github.io/huxtable/design-principles.html) for a handy comparison of different table "engines" in R). Here's a bare-bones example using the latter, since it works well with Rmarkdown documents.
```{r, message=F}
library(huxtable)
huxreg(ols4, ols5, ols6)
```
# Further reading
- Ed has outstanding lecture notes for both undergrad- and PhD-level econometrics courses on [his website](http://edrub.in/teaching.html). I believe that he is turning these notes into a book with some coauthors, so stay tuned.