-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04c - The Many Variables & The Suprious Waffles.qmd
557 lines (424 loc) · 32 KB
/
04c - The Many Variables & The Suprious Waffles.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
---
title: 04c - The Many Variables & The Spurious Waffles
author: Abdullah Mahmood
date: last-modified
format:
html:
theme: cosmo
css: quarto-style/style.css
highlight-style: atom-one
mainfont: Palatino
fontcolor: black
monobackgroundcolor: white
monofont: Menlo, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace
fontsize: 13pt
linestretch: 1.4
number-sections: true
number-depth: 4
toc: true
toc-location: right
toc-depth: 4
code-fold: false
code-copy: true
cap-location: bottom
format-links: false
embed-resources: true
anchor-sections: true
code-links:
- text: GitHub Repo
icon: github
href: https://github.com/abdullahau/bayesian-analysis
- text: Quarto Markdown
icon: file-code
href: https://github.com/abdullahau/bayesian-analysis/blob/main/04c%20-%20The%20Many%20Variables%20%26%20The%20Suprious%20Waffles.qmd
html-math-method:
method: mathjax
url: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
---
### Imports
```{python}
# ruff: noqa: F405
from init import *
from causalgraphicalmodels import CausalGraphicalModel
import daft as daft
%config InlineBackend.figure_formats = ['svg']
```
## Categorical Variables
A common question for statistical methods is to what extent an outcome changes as a result of presence or absence of a category. A category here means discrete and unordered. For example, consider the different species in the milk energy data again. Some of them are apes, while others are New World monkeys. We might want to ask how predictions should vary when the species is an ape instead of a monkey. Taxonomic group is a **categorical variable**, because no species can be half-ape and half-monkey (discreteness), and there is no sense in which one is larger or smaller than the other (unordered). Other common examples of categorical variables include:
- Sex: male, female
- Developmental status: infant, juvenile, adult
- Geographic region: Africa, Europe, Melanesia
Many readers will already know that variables like this, routinely called **factors**, can easily be included in linear models. But what is not widely understood is how these variables are represented in a model. The computer does all of the work for us, hiding the machinery. But there are some subtleties that make it worth exposing the machinery. Knowing how the machine works both helps you interpret the posterior distribution and gives you additional power in building the model.
------------------------------------------------------------------------
#### Continuous countries
With automated software and lack of attention, categorical variables can be dangerous. In 2015, a high-impact journal published a study of 1170 children from six countries, finding a strong negative association between religiosity and generosity. The paper caused a small stir among religion researchers, because it disagreed with the existing literature. Upon reanalysis, it was found that the country variable, which is categorical, was entered as a continuous variable instead. This made Canada (value 2) twice as much “country” as the United States (value 1). After reanalysis with country as a categorical variable, the result vanished and the original paper has been retracted. This is a happy ending, because the authors shared their data. How many cases like this exist, undiscovered because the data have never been shared and are possible lost forever?
------------------------------------------------------------------------
### Binary categories
In the simplest case, the variable of interest has only two categories, like male and female. Let’s rewind to the Kalahari data from earlier. Back then, we ignored sex when predicting height, but obviously we expect males and females to have different averages. Take a look at the variables available:
```{python}
d = pd.read_csv("data/Howell1.csv", sep=';')
d.head()
```
The `male` variable is our new predictor, an example of a **indicator variable**. Indicator variables—sometimes also called “dummy” variables—are devices for encoding unordered categories into quantitative models. There is no sense here in which “male” is one more than “female.” The purpose of the male variable is to indicate when a person in the sample is “male.” So it takes the value 1 whenever the person is male, but it takes the value 0 when the person belongs to any other category. It doesn’t matter which category is indicated by the 1. The model won’t care. But correctly interpreting the model demands that you remember, so it’s a good idea to name the variable after the category assigned the 1 value.
There are two ways to make a model with this information. The first is to use the indicator variable directly inside the linear model, as if it were a typical predictor variable. The effect of an indicator variable is to turn a parameter on for those cases in the category. Simultaneously, the variable turns the same parameter off for those cases in another category. This will make more sense, once you see it in the mathematical definition of the model. Consider again a linear model of height, from earlier. Now we’ll ignore weight and the other variables and focus only on sex.
$$
\begin{align*}
h_i &\sim \text{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha + \beta_m m_i \\
\alpha &\sim \text{Normal}(178,20) \\
\beta_m &\sim \text{Normal}(0,10)\\
\sigma &\sim \text{Uniform}(0,50) \\
\end{align*}
$$
where $h$ is height and $m$ is the dummy variable indicating a male individual. The parameter $β_m$ influences prediction only for those cases where $m_i = 1$. When $m_i = 0$, it has no effect on prediction, because it is multiplied by zero inside the linear model, $α + β_m m_i$, canceling it out, whatever its value. This is just to say that, when $m_i = 1$, the linear model is $\mu_i = α+β_m$. And when $m_i = 0$, the linear model is simply $\mu_i = α$.
Using this approach means that $β_m$ represents the expected *difference* between males and females in height. The parameter $α$ is used to predict both female and male heights. But male height gets an extra $β_m$. This also means that $α$ is no longer the average height in the sample, but rather just the average female height. This can make assigning sensible priors a little harder. If you don’t have a sense of the expected difference in height—what would be reasonable before seeing the data?—then this approach can be a bother. Of course you could get away with a vague prior in this case—there is a lot of data.
Another consequence of having to assign a prior to the difference is that this approach necessarily assumes there is more uncertainty about one of the categories—“male” in this case—than the other. Why? Because a prediction for a male includes two parameters and therefore two priors. We can simulate this directly from the priors. The prior distributions for $\mu$ for females and males are:
```{python}
mu_female = stats.norm.rvs(loc=178, scale=20, size=int(1e4))
mu_male = stats.norm.rvs(loc=178, scale=20, size=int(1e4)) + stats.norm.rvs(loc=0, scale=10, size=int(1e4))
utils.precis(pd.DataFrame({'mu_female': mu_female, 'mu_male': mu_male}))
```
The prior for males is wider, because it uses both parameters. While in a regression this simple, these priors will wash out very quickly, in general we should be careful. We aren’t actually more unsure about male height than female height, *a priori*. Is there another way?
Another approach available to us is an **index variable**. An index variable contains integers that correspond to different categories. The integers are just names, but they also let us reference a list of corresponding parameters, one for each category. In this case, we can construct our index like this:
```{python}
d['sex'] = np.where(d['male'] == 1, 2, 1)
d.head()
```
Now “1” means female and “2” means male. No order is implied. These are just labels. And the mathematical version of the model becomes:
$$
\begin{align*}
h_i &\sim \text{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha_{\text{sex}[i]} \\
\alpha_j &\sim \text{Normal}(178,20) \quad \text{for } j=1 \ldots 2 \\
\sigma &\sim \text{Uniform}(0,50) \\
\end{align*}
$$
What this does is create a list of $\alpha$ parameters, one for each unique value in the index variable. So in this case we end up with two $\alpha$ parameters, named $\alpha_1$ and $\alpha_2$. The numbers correspond to the values in the index variable `sex`. I know this seems overly complicated, but it solves our problem with the priors. Now the same prior can be assigned to each, corresponding to the notion that all the categories are the same, prior to the data. Neither category has more prior uncertainty than the other. And as you’ll see in a bit, this approach extends effortlessly to contexts with more than two categories.
Let’s approximate the posterior for the above model, the one using an index variable.
```{python}
m5_8 = """
data {
int<lower=1> N;
vector[N] height;
array[N] int sex;
}
parameters {
vector[2] a;
real<lower=0, upper=50> sigma;
}
transformed parameters {
vector[N] mu;
mu = a[sex];
}
model {
height ~ normal(mu, sigma);
sigma ~ uniform(0, 50);
a[1] ~ normal(178, 20);
a[2] ~ normal(178, 20);
}
generated quantities {
real diff_fm;
diff_fm = a[1] - a[2];
}
"""
data = {
'N': len(d),
'height': d.height.tolist(),
'sex': d.sex.tolist()
}
m5_8_model = utils.StanQuap('stan_models/m5_8', m5_8, data=data, generated_var=['diff_fm', 'mu'])
m5_8_model.precis().round(2)
```
Interpreting these parameters is easy enough—they are the expected heights in each category. But often we are interested in differences between categories. In this case, what is the expected difference between females and males? We can compute this using samples from the posterior. In fact, I’ll extract posterior samples into a data frame and insert our calculation directly into the same frame:
```{python}
post = m5_8_model.extract_samples(n=1000)
diff_fm = post['a'][:,0] - post['a'][:,1]
post_df = pd.DataFrame({'sigma': post['sigma'], 'a.1': post['a'][:,0], 'a.2': post['a'][:,1], 'diff_fm': diff_fm})
utils.precis(post_df).round(2)
```
Additionally, while setting up the model, I have added a `diff_fm` quantity inside the `generated quantities` block that directly computes the difference of $\alpha_1$ and $\alpha_2$ sampled from the posterior.
```{python}
post = m5_8_model.laplace_sample(draws=5).stan_variables()
post_df = pd.DataFrame({'sigma': post['sigma'], 'a.1': post['a'][:,0], 'a.2': post['a'][:,1], 'diff_fm': post['diff_fm']})
utils.precis(post_df).round(2)
```
Our calculation appears at the bottom, as a new parameter in the posterior. This is the expected difference between a female and male in the sample. This kind of calculation is called a **contrast**. No matter how many categories you have, you can use samples from the posterior to compute the contrast between any two.
### Many categories
Binary categories are easy, whether you use an indicator variable or instead an index variable. But when there are more than two categories, the indicator variable approach explodes. You’ll need a new indicator variable for each new category. If you have $k$ unique categories, you need $k−1$ indicator variables. Automated tools do in fact go this route, constructing $k−1$ indicator variables for you and returning $k−1$ parameters (in addition to the intercept).
But we’ll instead stick with the index variable approach. It does not change at all when you add more categories. You do get more parameters, of course, just as many as in the indicator variable approach. But the model specification looks just like it does in the binary case. And the priors continue to be easier, unless you really do have prior information about contrasts. It is also important to get used to index variables, because multilevel models depend upon them.
Let’s explore an example using the primate milk data again. We’re interested now in the `clade` variable, which encodes the broad taxonomic membership of each species:
```{python}
d = pd.read_csv("data/milk.csv", sep=';')
d.clade.unique()
```
We want an index value for each of these four categories. You could do this by hand, but just coercing the factor to an integer will do the job:
```{python}
d['clade_id'] = d.clade.astype('category').cat.codes + 1
```
Let’s use a model to measure the average milk energy in each clade. In math form:
$$
\begin{align*}
K_i &\sim \text{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha_{\text{CLADE}[i]} \\
\alpha_j &\sim \text{Normal}(0,0.5) \quad \text{for } j=1 \ldots 4 \\
\sigma &\sim \text{Exponential}(1) \\
\end{align*}
$$
Remember, $K$ is the standardized kilocalories. We have widened the prior on $\alpha$ a little, to allow the different clades to disperse, if the data wants them to. But I encourage you to play with that prior and repeatedly re-approximate the posterior so you can see how the posterior differences among the categories depend upon it. Firing up `StanQuap` now:
```{python}
d['K'] = utils.standardize(d['kcal.per.g'])
m5_9 = """
data {
int N;
vector[N] K;
array[N] int clade_id;
}
parameters {
real<lower=0> sigma;
vector[4] a;
}
transformed parameters {
vector[N] mu;
mu = a[clade_id];
}
model {
K ~ normal(mu, sigma);
sigma ~ exponential(1);
a ~ normal(0, .5);
}
"""
data = {
'N': len(d),
'K': d.K.tolist(),
'clade_id': d.clade_id.tolist()
}
m5_9_model = utils.StanQuap('stan_models/m5_9', m5_9, data=data, generated_var=['mu'])
summary = m5_9_model.precis()
summary.round(2)
```
```{python}
def plot_forest():
y_range = np.arange(4)
plt.hlines(
y_range[0],
xmin=summary['5.5%']['a.4'],
xmax=summary['94.5%']['a.4'], linewidth=2
)
plt.hlines(
y_range[1],
xmin=summary['5.5%']['a.3'],
xmax=summary['94.5%']['a.3'], linewidth=2
)
plt.hlines(
y_range[2],
xmin=summary['5.5%']['a.2'],
xmax=summary['94.5%']['a.2'], linewidth=2
)
plt.hlines(
y_range[3],
xmin=summary['5.5%']['a.1'],
xmax=summary['94.5%']['a.1'], linewidth=2
)
mean_range = np.array([
summary['Mean']['a.4'],
summary['Mean']['a.3'],
summary['Mean']['a.2'],
summary['Mean']['a.1']
])
plt.plot(mean_range, y_range, 'o', fillstyle='none')
plt.axvline(0, linestyle='--', linewidth=0.3)
plt.yticks(y_range, labels=['a.4: Strepsirrhine', 'a.3: Old World Monkey', 'a.2: New World Monkey', 'a.1: Ape'])
plt.xlabel('Expected kcal (std)')
plt.clf()
plot_forest()
```
In practice, we have to be very careful to keep track of which index values go with which categories.
If you have another kind of categorical variable that you’d like to add to the model, the approach is just the same. For example, let’s randomly assign these primates to some made up categories: \[1\] Gryffindor, \[2\] Hufflepuff, \[3\] Ravenclaw, and \[4\] Slytherin.
```{python}
np.random.seed(63)
d['house'] = np.random.choice(np.tile(np.arange(1, 5), 8), len(d), replace=False)
d.head()
```
Now we can include these categories as another predictor in the model:
```{python}
m5_10 = """
data {
int N;
vector[N] K;
array[N] int clade_id;
array[N] int house;
}
parameters {
real<lower=0> sigma;
vector[4] a;
vector[4] h;
}
transformed parameters {
vector[N] mu;
mu = a[clade_id] + h[house];
}
model {
K ~ normal(mu, sigma);
sigma ~ exponential(1);
a ~ normal(0, .5);
h ~ normal(0, .5);
}
"""
data = {
'N': len(d),
'K': d.K.tolist(),
'clade_id': d.clade_id.tolist(),
'house': d.house.tolist()
}
m5_10_model = utils.StanQuap('stan_models/m5_10', m5_10, data=data, generated_var=['mu'])
summary = m5_10_model.precis()
summary.round(2)
```
If you inspect the posterior, you’ll see that \[h.1\] Gryffindor and \[h.4\] Slytherin stands out.
------------------------------------------------------------------------
#### Differences and statistical significance
A common error in interpretation of parameter estimates is to suppose that because one parameter is sufficiently far from zero—is “significant”—and another parameter is not—is “not significant”—that the difference between the parameters is also significant. This is not necessarily so. This isn’t just an issue for non-Bayesian analysis: If you want to know the distribution of a difference, then you must compute that difference, a **contrast**. It isn’t enough to just observe, for example, that a slope among males overlaps a lot with zero while the same slope among females is reliably above zero. You must compute the posterior distribution of the difference in slope between males and females. For example, suppose you have posterior distributions for two parameters, $\beta_f$ and $\beta_m$. $\beta_f$’s mean and standard deviation is $0.15±0.02$, and $\beta_m$’s is $0.02±0.10$. So while $\beta_f$ is reliably different from zero (“significant”) and $\beta_m$ is not, the difference between the two (assuming they are uncorrelated) is $(0.15−0.02) ± \sqrt{0.02^2 + 0.1^2} ≈ 0.13 ± 0.10.$ The distribution of the difference overlaps a lot with zero. In other words, you can be confident that $\beta_f$ is far from zero, but you cannot be sure that the difference between $\beta_f$ and $\beta_m$ is far from zero.
In the context of non-Bayesian significance testing, this phenomenon arises from the fact that statistical significance is inferentially powerful in one way: difference from the null. When $\beta_m$ overlaps with zero, it may also overlap with values very far from zero. Its value is uncertain. So when you then compare $\beta_m$ to $\beta_f$, that comparison is also uncertain, manifesting in the width of the posterior distribution of the difference $\beta_f−\beta_m$. Lurking underneath this example is a more fundamental mistake in interpreting statistical significance: The mistake of accepting the null hypothesis. Whenever an article or book says something like “we found no difference” or “no effect,” this usually means that some parameter was not significantly different from zero, and so the authors adopted zero as the estimate. This is both illogical and extremely common.
------------------------------------------------------------------------
## Summary
This chapter introduced multiple regression, a way of constructing descriptive models for how the mean of a measurement is associated with more than one predictor variable. The defining question of multiple regression is: *What is the value of knowing each predictor, once we already know the other predictors?* The answer to this question does not by itself provide any causal information. Causal inference requires additional assumptions. Simple directed acyclic graph (DAG) models of causation are one way to represent those assumptions. In the next chapter we’ll continue building the DAG framework and see how adding predictor variables can create as many problems as it can solve.
## Practice Problems
### 5E1
Which of the linear models below are multiple linear regressions?
\(2\) $\mu_i = \beta_x x_i + \beta_z z_i$ and (4) $\mu_i = \alpha + \beta_x x_i + \beta_z z_i$
Only (2) and (4) are multiple linear regressions. Both have more than one predictor variable and corresponding coefficients in the linear model. The model (1) has only a single predictor variable, $x$. The model (3) has two predictor variables, but only their difference for each case enters the model, so effectively this is a uni-variate regression, with a single slope parameter.
### 5E2
Write down a multiple regression to evaluate the claim: *Animal diversity is linearly related to latitude, but only after controlling for plant diversity*. You just need to write down the model definition.
A verbal model statement like this will always be somewhat ambiguous. That is why mathemat- ical notation is needed in scientific communication. However, the conventional interpretation of the statement would be:
$$
\begin{align*}
A_i &\sim \text{Normal}(\mu_i , \sigma) \\
\mu_i &= \alpha + \beta_L L_i - \beta_P P_i \\
\end{align*}
$$
where A is animal diversity, L is latitude, and P is plant diversity. This linear model “controls” for plant diversity, while estimating a linear relationship between latitude and animal diversity.
### 5E3
Write down a multiple regression to evaluate the claim: *Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree*. Write down the model definition and indicate which side of zero each slope parameter should be on.
Define $T$ as time to PhD degree, the outcome variable implied by the problem. Define $F$ as amount of funding and $S$ as size of laboratory, the implied predictor variables. Then the model (ignoring priors) might be:
$$
\begin{align*}
Ai &\sim \text{Normal}(\mu_i , \sigma) \\
\mu_i &= \alpha + \beta_F F_i - \beta_S S_i \\
\end{align*}
$$
The slopes $\beta_F$ and $\beta_S$ should both be positive. How can both be positively associated with the outcome in a multiple regression, but neither by itself? If they are negatively correlated with one another, then considering each alone may miss the positive relationships with the outcome. For example, large labs have less funding per student. Small labs have more funding per student, but poorer intellectual environments. So both could be positive influences on time to degree, but be negatively associated in nature.
### 5E4
Suppose you have a single categorical predictor with 4 levels (unique values), labeled $A$, $B$, $C$ and $D$. Let $A_i$ be an indicator variable that is $1$ where case $i$ is in category $A$. Also suppose $B_i$, $C_i$, and $D_i$ for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model.
1. $\mu_i = \alpha + \beta_A A_i + \beta_B B_i + \beta_D D_i$
2. $\mu_i = \alpha + \beta_A A_i + \beta_B B_i + \beta_C C_i + \beta_D D_i$
3. $\mu_i = \alpha + \beta_B B_i + \beta_C C_i + \beta_D D_i$
4. $\mu_i = \alpha_A A_i + \alpha_B B_i + \alpha_C C_i + \alpha_D D_i$
5. $\mu_i = \alpha_A (1-B_i - C_i - D_i) + \alpha_B B_i + \alpha_C C_i + \alpha_D D_i$
First, the answer will actually depend upon the priors, which aren’t mentioned in the problem. But assuming weakly informative or flat priors, the answer is that (1), (3), (4), and (5) are inferentially equivalent. They’ll make the same predictions, and you can convert among them after model fitting. (2) stands out because it has a redundant parameter, the intercept $α$.
### 5M1
Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).
The are many good answers to this question. The easiest approach is to think of some context that follows the divorce rate pattern in the chapter: one predictor influences both the outcome and the other predictor. For example, we might consider predicting whether or not a scientific study replicates. Two predictor variables are available: (1) sample size and (2) statistical significance. Sample size influences both statistical significance and reliability of a finding. This induces a correlation between significance and successful replication, even though significance is not associated with replication, once sample size is taken into account.
### 5M2
Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.
Again, many good answers are possible. The pattern from the milk energy example in the chapter is the simplest. Consider for example the influences of income and drug use on health. Income is positively associated, in reality, with health. Drug use is, for the sake of the example, negatively associated with health. But wealthy people consume more drugs than the poor, simply because the wealthy can afford them. So income and drug use are positively associated in the population. If this positive association is strong enough, examining either income or drug use alone will show only a weak relationship with health, because each works on health in opposite directions.
### 5M3
It is sometimes observed that the best predictor of fire risk is the presence of firefighters— States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?
Divorce might lead to, or be in expectation of, remarriage. Thus divorce could cause marriage rate to rise. In order to examine this idea, or another like it, the data would need to be structured into more categories, such as remarriage rate versus first marriage rate. Better yet would be longitudinal data. In many real empirical contexts, causation involves feedback loops that can render regression fairly useless, unless some kind of time series framework is used.
### 5M4
In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.
It is worth finding and entering the values yourself, for the practice at data management. But here are the values I found, scraped from Wikipedia and merged into the original data:
```{python}
# load data
d = pd.read_csv("data/WaffleDivorce.csv", sep=';')
# standardize variables
d['D'] = utils.standardize(d.Divorce)
d['M'] = utils.standardize(d.Marriage)
d['A'] = utils.standardize(d.MedianAgeMarriage)
d['pct_LDS'] = [0.75, 4.53, 6.18, 1, 2.01, 2.82, 0.43, 0.55, 0.38,
0.75, 0.82, 5.18, 26.35, 0.44, 0.66, 0.87, 1.25, 0.77, 0.64, 0.81,
0.72, 0.39, 0.44, 0.58, 0.72, 1.14, 4.78, 1.29, 0.61, 0.37, 3.34,
0.41, 0.82, 1.48, 0.52, 1.2, 3.85, 0.4, 0.37, 0.83, 1.27, 0.75,
1.21, 67.97, 0.74, 1.13, 3.99, 0.92, 0.44, 11.5]
d.head()
```
A first regression model including this variable might be:
```{python}
m_5M4 = """
data {
int N;
vector[N] Divorce;
vector[N] Marriage;
vector[N] MedianAgeMarriage;
vector[N] pct_LDS;
}
parameters {
real<lower=0, upper=10> sigma;
real a;
real bR;
real bA;
real bM;
}
transformed parameters {
vector[N] mu;
mu = a + bR * Marriage + bA * MedianAgeMarriage + bM * pct_LDS;
}
model {
Divorce ~ normal(mu, sigma);
sigma ~ uniform(0, 10);
a ~ normal(0, 100);
bR ~ normal(0, 10);
bA ~ normal(0, 10);
bM ~ normal(0, 10);
}
"""
data = {
'N': len(d),
'Divorce': d.Divorce.tolist(),
'Marriage': d.Marriage.tolist(),
'MedianAgeMarriage': d.MedianAgeMarriage.tolist(),
'pct_LDS': d.pct_LDS.tolist()
}
m_5M4_model = utils.StanQuap(
'stan_models/m_5M4', m_5M4,
data=data, generated_var=['mu'],
algorithm='LBFGS'
)
m_5M4_model.precis().round(2)
```
As expected, there is a negative association between percent LDS and divorce rate. This model assumes the relationship between divorce rate and percent LDF is linear. This makes sense if the LDS community has a lower divorce rate within itself only, and so as it makes up more of a State’s population, that State’s divorce rate declines. This is to say that the expected divorce rate of State $i$ is a “convex” mix of two average divorce rates:
$$
D_i = (1-P)D_G + PD_{LDS}
$$
where $D_i$ is the divorce rate for State $i$, $P$ is the proportion of the State’s population that is LDS, and the two divorce rates $D_G$ and $D_{LDS}$ are the divorce rates for gentiles (non-LDS) and LDS, respectively. If $D_G > D_{LDS}$, then as $P$ increases, the value of $D_i$ increases linearly as well. But maybe the percent LDS in the population has a secondary impact as a marker of a State-level cultural environment that has lower divorce in more demographic groups than just LDS. In that case, this model will miss that impact. Can you think of a way to address this? (Maybe, we should evaluate the significance of the differences in $D_{G}$ between different states based on $1-P$, proportion of the population that is not a member of the LDS)
### 5M5
One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.
This is an open-ended question with many good, expanding answers. Here’s the basic outline of an approach. The first two implied variables are the rate of obesity $O$ and the price of gasoline $P$. The first proposed mechanism suggests that higher price $P$ reduces driving $D$, which in turn increases exercise $X$, which then reduces obesity $O$. As a set of regressions, this mechanism implies:
1. $D$ as a declining function of $P$
2. $X$ as a declining function of $D$
3. $O$ as a declining function of $X$
In other words, for the mechanism to work, each predictor above needs be negatively associated with each outcome. Note that each outcome becomes a predictor. That’s just how these causal chains look. A bunch of reasonable control variables could be added to each of the regressions above. Consider for example that a very wealthy person will be more insensitive to changes in price, so we might think to interact $P$ with income. The second proposed mechanism suggests that price $P$ reduces driving $D$ which reduces eating out $E$ which reduces obesity $O$. A similar chain of regressions is implied:
1. $D$ as a declining function of $P$
2. $E$ as an increasing function of $D$
3. $O$ as an increasing function of $E$
### 5H1
In the divorce example, suppose the DAG is: $M →A →D$. What are the implied conditional independencies of the graph? Are the data consistent with it?
```{python}
dag_5H1 = CausalGraphicalModel(nodes=["A", "D", "M"], edges=[("M", "A"), ("A", "D")])
pgm = daft.PGM()
coordinates = {"A": (0, 0), "D": (1, 1), "M": (2, 0)}
for node in dag_5H1.dag.nodes:
pgm.add_node(node, node, *coordinates[node], plot_params={'linewidth': 0})
for edge in dag_5H1.dag.edges:
pgm.add_edge(*edge)
pgm.render()
plt.gca().invert_yaxis()
```
```{python}
print('DAG:', dag_5H1.get_distribution())
# get_all_independence_relationships() method
# Returns a list of all pairwise conditional independence relationships
# implied by the graph structure.
print(dag_5H1.get_all_independence_relationships())
```
### 5H2
Assuming that the DAG for the divorce example is indeed $M→A→D$, fit a new model and use it to estimate the counterfactual effect of halving a State’s marriage rate $M$. Using the counterfactual example from the chapter as a template.