-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathLab 331.Rmd
267 lines (208 loc) · 9.45 KB
/
Lab 331.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
---
title: 'March 31st: Class Example'
author: "Jeffrey Arnold and Daniel Yoo"
date: "3/31/2017"
output:
html_document: default
pdf_document: default
---
```{r}
library(tidyverse)
```
## What is regression?
Regression analysis is a procedure by which conditional relationships in data may be described (Berk 2010)
Linear regression is a method that summarizes how average values of a numerical outcome variable vary over subpopulations defined by linear functions of predictors (Gelman and Hill 2007)
The linear regression model is a regression model that where the outcome variable is a linear function of one or several explanatory variables plus an error term (Wooldridge 2002).
$$y = \beta_{0} + \beta_{1}x + e$$
where $y$ is the outcome variable, $\beta_{0} + \beta_{1}x$ is the linear conditional expectation function (or the regression line), and $e$ is the error term. $y = \beta_{0} + \beta_{1}x + e$ can also be written as $y = E(y|x) + e$.
Fundamentally, regression is a procedure that is used to summarize *conditional* relationships. That is, the average value of an outcome variable conditional on different values of one or more explanatory variables.
### Conditional expectation function with discrete covariates
Consider the `Titanic` dataset included in the recommended R package **datasets**.
It is a cross-tabulation of 2,201 observations with four variables:
```{r}
Titanic <- as_tibble(datasets::Titanic) %>%
mutate(Survived = (Survived == "Yes"))
```
Consider the outcome variable `Survived`.
The overall mean (proportion since it is binary) for `Survived` is:
```{r}
summarise(Titanic, prop_survived = sum(n * Survived) / sum(n))
```
A conditional expectation function is a function that calculates the mean of `Y` for different values of `X`. For example, the conditional expectation function for
Calculate the CEF for `Survived` conditional on
- `Age`
```{r}
Titanic %>% group_by(Age) %>% summarise(prop_survived = sum(n * Survived) / sum(n))
```
- `Sex`
```{r}
Titanic %>% group_by(Sex) %>% summarise(prop_survived = sum(n * Survived) / sum(n))
```
- `Class`
```{r}
Titanic %>% group_by(Class) %>% summarise(prop_survived = sum(n * Survived) / sum(n))
```
- `Age, Sex, Class`
```{r}
Titanic %>% group_by(Class, Age, Sex) %>% summarise(prop_survived = sum(n * Survived) / sum(n))
```
What is the predicted probability of surival for each of these characters from the movie *Titanic*?
- Rose (Kate Winslet) : survived, 1st class, adult, female
```{r}
Titanic %>% filter(Class=="1st", Age=="Adult", Sex=="Female") %>% summarise(prop_survived = sum(n * Survived) / sum(n))
```
- Jack (Leonardo DiCaprio) : died, 3rd class, adult, male
```{r}
Titanic %>% filter(Class=="3rd", Age=="Adult", Sex=="Male") %>% summarise(prop_survived = sum(n * Survived) / sum(n))
```
- Cal (Billy Zane) : survived, 1st class, adult, male
```{r}
Titanic %>% filter(Class=="1st", Age=="Adult", Sex=="Male") %>% summarise(prop_survived = sum(n * Survived) / sum(n))
```
###Linear regression and the conditional expectation function
Now compute the predicted probabilities of survival for each of these characters using linear regression
```{r}
data.ext <- NULL
for (i in 1:nrow(Titanic)){
data.ext <- rbind(data.ext, as.data.frame(lapply(Titanic[i,], rep, Titanic[i, 5])))
}
Age.reg <- lm(Survived ~ Age, data=data.ext)
summary(Age.reg)
Sex.reg <- lm(Survived ~ Sex, data=data.ext)
summary(Sex.reg)
Class.reg <- lm(Survived ~ Class, data=data.ext)
summary(Class.reg)
titanic.reg <- lm(Survived ~ Class + Age + Sex + Class:Age + Class:Sex + Age:Sex + Class:Age:Sex, data=data.ext)
summary(titanic.reg)
titanic.characters <- rbind( c("1st", "Adult", "Female"),
c("3rd", "Adult", "Male"),
c("1st", "Adult", "Male"))
titanic.characters <- as.data.frame(titanic.characters)
rownames(titanic.characters) <- c("Rose", "Jack", "Cal")
colnames(titanic.characters) <- c("Class", "Age", "Sex")
predict(titanic.reg, titanic.characters)
```
Why do we use the conditional expectation function to predict or explain the outcome variable? This is because the CEF is the function that best predicts the outcome variable, in the sense that it minimizes the mean squared prediction error.
## Regression to the Mean
Francis Galton (1886) examined the joint distribution of the heights of parents and their children. He was estimating the average height of children conditional upon the height of their parents. He found that this relationship was approximately linear with a slope of 2/3.
This means that on average taller parents had taller children, but the children of taller parents were on average shorter than they were, and the children of shorter parents were on average taller than they were. In other words, children's height was more average than parent's height.
This phenomenon was called regression to the mean, and the term regression is now used to describe conditional relationships (Hansen 2010).
His key insight was that if the marginal distributions of two variables are the same, then the linear slope will be less than one.
He also found that when the variables are standardized, the slope of the regression of $y$ on $x$ and $x$ on $y$ are the same. They are both the correlation between $x$ and $y$, and they both show regression to the mean.
```{r}
library("HistData")
```
```{r}
Galton <- as_tibble(Galton)
Galton
```
1. Calculate the regression of children's heights on parents. Interpret the regression.
```{r}
child.reg <- lm(child ~ parent, data=Galton)
child.reg
```
2. Plot the conditional expectation function. What do you observe?
```{r}
plot(Galton$parent, Galton$child, pch=16, cex=0.5, main="Height of Children vs. Parents", xlab="Parent Height (inches)", ylab="Child Height (inches)")
abline(child.reg, col="Blue")
```
###Reverse Regression
3. Calculate the regression of parent's heights on children's heights. Interpret the regression.
```{r}
parent.reg <- lm(parent ~ child, data=Galton)
parent.reg
```
4.Plot the CEF. What do you observe?
```{r}
plot(Galton$child, Galton$parent, pch=16, cex=0.5, main="Height of Parents vs Children", xlab="Child Height (inches)", ylab="Parent Height (inches)")
abline(parent.reg, col="Red")
```
5. Check the mean and variance of parents' and childrens' height
```{r}
mean(Galton$parent)
mean(Galton$child)
var(Galton$parent)
var(Galton$child)
```
6. Perform the both regressions using standardized variables.
```{r}
parent.std <- (Galton$parent-mean(Galton$parent))/sd(Galton$parent)
child.std <- (Galton$child-mean(Galton$child))/sd(Galton$child)
summary(child.std.reg <- lm(child.std ~ parent.std))
summary(parent.std.reg <- lm(parent.std ~ child.std))
par(mfrow=c(1,2))
plot(parent.std, child.std, pch=16, cex=0.5, main="Height of Children vs Parents", xlab="Parent Height (inches)", ylab="Child Height (inches)")
abline(child.std.reg, col="Blue")
plot(child.std, parent.std, pch=16, cex=0.5, main="Height of Parents vs Children", xlab="Child Height (inches)", ylab="Parent Height (inches)")
abline(parent.std.reg, col="Red")
```
Regression calculates the conditional expectation function, $f(Y, X) = E(Y | X) + \epsilon$, but we could instead jointly model $Y$ and $X$. This is a topic for multivariate statistical (principal components, factor analyis, clustering).
In this case, an alternative would be to model the heights of fathers and sons as a bivariate normal distribution.
```{r}
ggplot(Galton, aes(y = child, x = parent)) +
geom_jitter() +
geom_density2d()
```
```{r}
# covariance matrix
Galton_mean <- c(mean(Galton$parent), mean(Galton$child))
# variance covariance matrix
Galton_cov <- cov(Galton)
Galton_cov
var(Galton$parent)
var(Galton$child)
cov(Galton$parent, Galton$child)
```
Calculate density for a multivariate normal distribution
```{r}
library("mvtnorm")
Galton_mvnorm <- function(parent, child) {
# mu and Sigma will use the values calculated earlier
dmvnorm(cbind(parent, child), mean = Galton_mean,
sigma = Galton_cov)
}
```
```{r}
Galton_mvnorm(Galton$parent[1], Galton$child[1])
```
```{r}
library("modelr")
Galton_dist <- Galton %>%
modelr::data_grid(parent = seq_range(parent, 50), child = seq_range(child, 50)) %>%
mutate(dens = map2_dbl(parent, child, Galton_mvnorm))
```
Why don't I calculate the mean and density using the data grid?
```{r}
library("viridis")
ggplot(Galton_dist, aes(x = parent, y = child)) +
geom_raster(mapping = aes(fill = dens)) +
#geom_contour(mapping = aes(z = dens), colour = "white", alpha = 0.3) +
#geom_jitter(data = Galton, colour = "white", alpha = 0.2) +
scale_fill_viridis() +
theme_minimal() +
theme(panel.grid = element_blank()) +
labs(y = "Parent height (in)", x = "Child height (in)")
```
Using the [plotly](https://plot.ly/r/getting-started/) library
we can make an interactive 3D plot:
```{r}
x <- unique(Galton_dist$parent)
y <- unique(Galton_dist$child)
z <- Galton_dist %>%
arrange(child, parent) %>%
spread(parent, dens) %>%
select(-child) %>%
as.matrix()
plotly::plot_ly(z = z, type = "surface")
```
But with regression we are calculating only one margin.
```{r}
Galton_means <- Galton %>%
group_by(parent) %>%
summarise(child = mean(child))
ggplot(Galton, aes(x = factor(parent), y = child)) +
geom_jitter(width = 0) +
geom_point(data = Galton_means, colour = "red")
```
Note that in this example, it doesn't really matter since a bivariate normal distribution happens to describe the data very well.
This is not true in general, and we are simplifying our analysis by calculating the CEF rather than jointly modeling both.