-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04b - The Many Variables & The Suprious Waffles.qmd
607 lines (496 loc) · 25.9 KB
/
04b - The Many Variables & The Suprious Waffles.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
---
title: 04b - The Many Variables & The Spurious Waffles
author: Abdullah Mahmood
date: last-modified
format:
html:
theme: cosmo
css: quarto-style/style.css
highlight-style: atom-one
mainfont: Palatino
fontcolor: black
monobackgroundcolor: white
monofont: Menlo, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace
fontsize: 13pt
linestretch: 1.4
number-sections: true
number-depth: 4
toc: true
toc-location: right
toc-depth: 4
code-fold: false
code-copy: true
cap-location: bottom
format-links: false
embed-resources: true
anchor-sections: true
code-links:
- text: GitHub Repo
icon: github
href: https://github.com/abdullahau/bayesian-analysis
- text: Quarto Markdown
icon: file-code
href: https://github.com/abdullahau/bayesian-analysis/blob/main/04b%20-%20The%20Many%20Variables%20%26%20The%20Suprious%20Waffles.qmd
html-math-method:
method: mathjax
url: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
---
### Imports
```{python}
# ruff: noqa: F405
from init import *
from causalgraphicalmodels import CausalGraphicalModel
import daft as daft
%config InlineBackend.figure_formats = ['svg']
```
## Masked Relationship
The divorce rate example demonstrates that multiple predictor variables are useful for knocking out spurious association. A second reason to use more than one predictor variable is to measure the direct influences of multiple factors on an outcome, when none of those influences is apparent from bivariate relationships. This kind of problem tends to arise when there are two predictor variables that are correlated with one another. However, one of these is positively correlated with the outcome and the other is negatively correlated with it.
You’ll consider this kind of problem in a new data context, information about the composition of milk across primate species, as well as some facts about those species, like body mass and brain size. Milk is a huge investment, being much more expensive than gestation. Such an expensive resource is likely adjusted in subtle ways, depending upon the physiological and development details of each mammal species. Let’s load the data:
```{python}
d = pd.read_csv("data/milk.csv", sep=';')
```
You should see in the structure of the data frame that you have 29 rows for 8 variables. The variables we’ll consider for now are `kcal.per.g` (kilocalories of energy per gram of milk), `mass` (average female body mass, in kilograms), and `neocortex.perc` (percent of total brain mass that is neocortex mass).
A popular hypothesis has it that primates with larger brains produce more energetic milk, so that brains can grow quickly. Answering questions of this sort consumes a lot of effort in evolutionary biology, because there are many subtle statistical issues that arise when comparing species. It doesn’t help that many biologists have no reference model other than a series of regressions, and so the output of the regressions is not really interpretable. The causal meaning of statistical estimates always depends upon information outside the data.
We won’t solve these problems here. But we will explore a useful example. The question here is to what extent energy content of milk, measured here by kilocalories, is related to the percent of the brain mass that is neocortex. Neocortex is the gray, outer part of the brain that is especially elaborate in some primates. We’ll end up needing female body mass as well, to see the masking that hides the relationships among the variables. Let’s standardize these three variables. As in previous examples, standardizing helps us both get a reliable approximation of the posterior as well as build reasonable priors.
```{python}
d['K'] = utils.standardize(d['kcal.per.g'])
d['N'] = utils.standardize(d['neocortex.perc'])
d['M'] = utils.standardize(np.log(d['mass']))
```
The first model to consider is the simple bivariate regression between kilocalories and neocortex percent. You already know how to set up this regression. In mathematical form:
$$
\begin{align*}
K_i &\sim \text{Normal}(\mu_i,\sigma) \\
\mu_i &= \alpha + \beta_N N_{i} \\
\end{align*}
$$
where $K$ is standardized kilocalories and $N$ is standardized neocortex percent. We still need to consider the priors. But first let’s just try to run this as a `StanQuap` model with some vague priors, because there is another key modeling issue to address first. We note that the column `d.N` consists of `NaN` or missing values. If you pass a vector like this to Stan's likelihood function, it doesn’t know what to do. After all, what’s the probability of a missing value? Whatever the answer, it isn’t a number, and so Stan returns an error failing to evaluate the log probability at the "initial value".
```{python}
d.N
```
This is easy to fix. What you need to do here is manually drop all the cases with missing values. This is known as a **complete case analysis**. Some automated model fitting commands will silently drop such cases for you. But this isn’t always a good thing. First, it’s validity depends upon the process that caused these particular values to go missing. Later, we'll explore this in much more depth. Second, once you start comparing models, you must compare models fit to the same data. If some variables have missing values that others do not, automated tools will silently produce misleading comparisons. Let’s march forward for now, dropping any cases with missing values. It’s worth learning how to do this yourself. To make a new data frame with only complete cases, use:
```{python}
dcc = d.dropna(subset=['K', 'N', 'M']).reset_index(drop=True)
dcc.shape
```
This makes a new data frame, `dcc`, that consists of the 17 rows from `d` that have no missing values in any of the variables listed inside `df.dropna`. Now let’s work with the new data frame:
```{python}
m5_5_draft = '''
data {
int<lower=0> n;
vector[n] N_seq;
}
generated quantities {
real a = normal_rng(0, 1);
real bN = normal_rng(0, 1);
vector[n] mu = a + bN * N_seq;
}
'''
data = {'n': 2, 'N_seq': [-2, 2]}
m5_5_draft_model = utils.Stan('stan_models/m5_5_draft', m5_5_draft)
```
Before considering the posterior predictions, let’s consider those priors. As in many simple linear regression problems, these priors are harmless. But are they reasonable? It is important to build reasonable priors, because as the model becomes less simple, the priors can be very helpful, but only if they are scientifically reasonable. To simulate and plot 50 prior regression lines:
```{python}
mu = m5_5_draft_model.laplace_sample(data=data, draws=1000).stan_variable('mu')
```
```{python}
def plot_prior():
plt.plot([-2,2], mu[:50,].T, 'k', alpha=0.2)
plt.xlim(-2.1,2.1)
plt.ylim(-2.1,2.1)
plt.xlabel('Neocortex Percent (std)')
plt.ylabel('Kilocal per g (std)')
plt.title(r'$\alpha$ ~ Normal(0, 1), $\beta_N$ ~ Normal(0, 1)')
plt.clf()
plot_prior()
```
We’ve shown a range of 2 standard deviations for both variables. So that is most of the outcome space. These lines are crazy. As in previous examples, we can do better by both tightening the $α$ prior so that it sticks closer to zero. With two standardized variables, when predictor is zero, the expected value of the outcome should also be zero. And the slope $β_N$ needs to be a bit tighter as well, so that it doesn’t regularly produce impossibly strong relationships. Here’s an attempt:
```{python}
m5_5 = '''
data {
int<lower=0> n;
vector[n] N;
vector[n] K;
int<lower=0> n_tilde;
vector[n_tilde] N_seq;
}
parameters {
real a;
real bN;
real<lower=0> sigma;
}
transformed parameters {
vector[n] mu;
mu = a + bN * N;
}
model {
K ~ normal(mu, sigma);
a ~ normal(0, 0.2);
bN ~ normal(0, 0.5);
sigma ~ exponential(1);
}
generated quantities {
real a_sim = normal_rng(0, 0.2);
real bN_sim = normal_rng(0, 0.5);
vector[n_tilde] mu_sim = a_sim + bN_sim * N_seq;
}
'''
data = {'n': len(dcc),
'N': dcc.N.tolist(),
'K': dcc.K.tolist(),
'n_tilde': 2,
'N_seq': [-2, 2]}
m5_5_model = utils.StanQuap('stan_models/m5_5', m5_5, data=data,
generated_var=['mu_sim','mu', 'a_sim', 'bN_sim'])
```
```{python}
mu = m5_5_model.extract_samples(n=1000, select=['mu_sim'])
```
```{python}
def plot_prior():
plt.plot([-2,2], mu['mu_sim'][:50,].T, 'k', alpha=0.2)
plt.xlim(-2.1,2.1)
plt.ylim(-2.1,2.1)
plt.xlabel('Neocortex Percent (std)')
plt.ylabel('Kilocal per g (std)')
plt.title(r'$\alpha$ ~ Normal(0, 0.2), $\beta_N$ ~ Normal(0, 0.5)')
plt.clf()
plot_prior()
```
The modified priors produces the above plot. These are still very vague priors, but at least the lines stay within the high probability region of the observable data.
Now let's look at the posterior:
```{python}
m5_5_model.precis().round(2)
```
From this summary, you can possibly see that this is neither a strong nor very precise association. The standard deviation is almost twice the posterior mean. But as always, it’s much easier to see this if we draw a picture. We can plot the predicted mean and 89% compatibility interval for the mean to see this more easily. The code below contains no surprises. But we have extended the range of $N$ values to consider, in `N_seq`, so that the plot looks nicer.
```{python}
xseq = np.linspace(dcc.M.min()-0.15, dcc.M.max()+0.15, 30)
def lm(post, predictor):
a, bN, sigma = post.values()
mu = a + bN * predictor.reshape(-1,1)
return mu.T
mu = m5_5_model.link(lm_func=lm, predictor=xseq, select=['a', 'bN', 'sigma'])
mu_mean, mu_plo, mu_phi = utils.precis(mu)
def plot_post():
plt.plot(dcc.N, dcc.K, 'o', fillstyle='none')
plt.plot(xseq, mu_mean, c="black")
plt.fill_between(xseq, mu_plo, mu_phi, alpha=0.2)
plt.xlabel("Neocortex Percent (std)")
plt.ylabel("Kilocal per g (std)")
plt.clf()
plot_post()
```
The posterior mean line is weakly positive, but it is highly imprecise. A lot of mildly positive and negative slopes are plausible, given this model and these data.
Now consider another predictor variable, adult female body mass, `mass` in the data frame. Let’s use the logarithm of mass, `np.log(mass)`, as a predictor as well. Why the logarithm of mass instead of the raw mass in kilograms? It is often true that *scaling measurements like body mass are related by magnitudes to other variables*. Taking the *log of a measure translates the measure into magnitudes*. So by using the logarithm of body mass here, we’re saying that we suspect that the magnitude of a mother’s body mass is related to milk energy, in a linear fashion. Much later, we’ll see why these logarithmic relationships are almost inevitable results of the physics of organisms.
Now we construct a similar model, but consider the bivariate relationship between kilocalories and body mass. Since body mass is also standardized, we can use the same priors and stay within possible outcome values. But if you were a domain expert in growth, you could surely do better than this.
```{python}
m5_6 = '''
data {
int<lower=0> n;
vector[n] M;
vector[n] K;
int<lower=0> n_tilde;
vector[n_tilde] xseq;
}
parameters {
real a;
real bM;
real<lower=0> sigma;
}
transformed parameters {
vector[n] mu;
mu = a + bM * M;
}
model {
K ~ normal(mu, sigma);
a ~ normal(0, 0.2);
bM ~ normal(0, 0.5);
sigma ~ exponential(1);
}
generated quantities {
vector[n_tilde] y_tilde = a + bM * xseq;
}
'''
xseq = np.linspace(dcc.M.min()-0.15, dcc.M.max()+0.15, 30)
data = {'n': len(dcc),
'M': dcc.M.tolist(),
'K': dcc.K.tolist(),
'n_tilde': len(xseq),
'xseq': xseq.tolist()}
m5_6_model = utils.StanQuap('stan_models/m5_6', m5_6, data=data,
generated_var=['mu', 'y_tilde'])
m5_6_model.precis().round(2)
```
```{python}
mu = m5_6_model.extract_samples(n=1000, select=['y_tilde'])
mu_mean, mu_plo, mu_phi = utils.precis(mu['y_tilde'])
def plot_post():
plt.plot(dcc.M, dcc.K, 'o', fillstyle='none')
plt.plot(xseq, mu_mean, c="black")
plt.fill_between(xseq, mu_plo, mu_phi, alpha=0.2)
plt.xlabel("Log Body Mass (std)")
plt.ylabel("Kilocal per g (std)")
plt.clf()
plot_post()
```
Log-mass is negatively associated with kilocalories. This association does seem stronger than that of neocortex percent, although in the opposite direction. It is quite uncertain though, with a wide compatibility interval that is consistent with a wide range of both weak and stronger relationships.
Now let’s see what happens when we add both predictor variables at the same time to the regression. This is the multivariate model, in math form:
$$
\begin{align*}
K_i &\sim \text{Normal}(\mu_i,\sigma) \\
\mu_i &= \alpha + \beta_N N_{i} + \beta_M M_i \\
\alpha &\sim \text{Normal}(0,0.2) \\
\beta_N &\sim \text{Normal}(0,0.5) \\
\beta_M &\sim \text{Normal}(0,0.5) \\
\sigma &\sim \text{Exponential}(1) \\
\end{align*}
$$
```{python}
m5_7 = '''
data {
int<lower=0> n;
vector[n] K;
vector[n] N;
vector[n] M;
int<lower=0> n_tilde; // Number of counterfactual simulations
vector[n_tilde] N_seq; // Counterfactual N values for N -> M -> D path
vector[n_tilde] M_seq; // Counterfactual M values for direct M -> D path
}
parameters {
real a;
real bN;
real bM;
real<lower=0> sigma;
}
transformed parameters {
vector[n] mu;
mu = a + bN * N + bM * M; // N -> D <- M
}
model {
// Priors
a ~ normal(0, 0.2);
bN ~ normal(0, 0.5);
bM ~ normal(0, 0.5);
sigma ~ exponential(1);
// Likelihood
K ~ normal(mu, sigma);
}
generated quantities {
vector[n_tilde] mu_tilde;
vector[n_tilde] K_tilde;
for (i in 1:n_tilde) {
mu_tilde[i] = a + bN * N_seq[i] + bM * M_seq[i];
}
for (i in 1:n_tilde) {
K_tilde[i] = normal_rng(mu_tilde[i], sigma);
}
}
'''
xseq = np.linspace(dcc.M.min()-0.15, dcc.M.max()+0.15, 30)
data = {'n': len(dcc),
'K': dcc.K.tolist(),
'N': dcc.N.tolist(),
'M': dcc.M.tolist(),
'n_tilde': len(xseq),
'N_seq': np.zeros_like(xseq).tolist(),
'M_seq': xseq.tolist()}
m5_7_model = utils.StanQuap('stan_models/m5_7', m5_7, data=data, algorithm='LBFGS',
generated_var=['mu_tilde', 'K_tilde', 'mu'])
```
```{python}
m5_7_model.precis().round(2)
```
By incorporating both predictor variables in the regression, the posterior association of both with the outcome has increased. Visually comparing this posterior to those of the previous two models helps to see the pattern of change:
```{python}
def plot_forest():
y_range = np.arange(4)
plt.hlines(y_range[1],
xmin=m5_7_model.precis()['5.5%']['bN'],
xmax=m5_7_model.precis()['94.5%']['bN'], linewidth=2)
plt.hlines(y_range[0],
xmin=m5_5_model.precis()['5.5%']['bN'],
xmax=m5_5_model.precis()['94.5%']['bN'], linewidth=2)
plt.hlines(y_range[3],
xmin=m5_7_model.precis()['5.5%']['bM'],
xmax=m5_7_model.precis()['94.5%']['bM'], linewidth=2)
plt.hlines(y_range[2],
xmin=m5_6_model.precis()['5.5%']['bM'],
xmax=m5_6_model.precis()['94.5%']['bM'], linewidth=2)
mean_range = np.array([m5_5_model.precis()['Mean']['bN'],
m5_7_model.precis()['Mean']['bN'],
m5_6_model.precis()['Mean']['bM'],
m5_7_model.precis()['Mean']['bM']])
plt.plot(mean_range, y_range, 'o', fillstyle='none')
plt.axvline(0, linestyle='--', linewidth=0.3)
plt.yticks(y_range, labels=['bN - m5_5', 'bN - m5_7', 'bM - m5_6', 'bM - m5_7'])
plt.xlabel('Estimate')
plt.clf()
plot_forest()
```
The posterior means for neocortex percent and log-mass have both moved away from zero. Adding both predictors to the model seems to have made their estimates move apart.
What happened here? Why did adding neocortex and body mass to the same model lead to stronger associations for both? This is a context in which there are two variables correlated with the outcome, but one is positively correlated with it and the other is negatively correlated with it. In addition, both of the explanatory variables are positively correlated with one another. Try a simple pairs plot to appreciate this pattern of correlation. The result of this pattern is that the variables tend to cancel one another out.
```{python}
def plot_pairs():
_, ax = plt.subplots(3,3)
az.plot_pair({'K': dcc.K, 'M': dcc.M, 'N': dcc.N}, marginals=True, ax=ax);
plt.clf()
plot_pairs()
```
This is another case in which multiple regression automatically finds the most revealing cases and uses them to produce inferences. What the regression model does is ask if species that have high neocortex percent for their body mass have higher milk energy. Likewise, the model asks if species with high body mass for their neocortex percent have higher milk energy. Bigger species, like apes, have milk with less energy. But species with more neocortex tend to have richer milk. The fact that these two variables, body size and neocortex, are correlated across species makes it hard to see these relationships, unless we account for both.
Some DAGs will help. There are at least three graphs consistent with these data.
```{python}
dag5_1 = CausalGraphicalModel(nodes=["M", "N", "K"], edges=[("M", "N"), ("M", "K"), ("N", "K")])
pgm1 = daft.PGM()
coordinates = {"M": (0, 0), "K": (1, 1), "N": (2, 0)}
for node in dag5_1.dag.nodes:
pgm1.add_node(node, node, *coordinates[node], plot_params={'linewidth': 0})
for edge in dag5_1.dag.edges:
pgm1.add_edge(*edge)
pgm1.render()
plt.gca().invert_yaxis()
```
```{python}
dag5_2 = CausalGraphicalModel(nodes=["M", "N", "K"], edges=[("N", "M"), ("M", "K"), ("N", "K")])
pgm2 = daft.PGM()
coordinates = {"M": (0, 0), "K": (1, 1), "N": (2, 0)}
for node in dag5_2.dag.nodes:
pgm2.add_node(node, node, *coordinates[node], plot_params={'linewidth': 0})
for edge in dag5_2.dag.edges:
pgm2.add_edge(*edge)
pgm2.render()
plt.gca().invert_yaxis()
```
```{python}
dag5_3 = CausalGraphicalModel(nodes=["M", "N", "K", "U"], edges=[("U", "M"), ("U", "N"), ("M", "K"), ("N", "K")])
pgm3 = daft.PGM()
coordinates = {"M": (0, 0), "K": (1, 1), "N": (2, 0), "U": (1, 0)}
for node in dag5_3.dag.nodes:
if node != "U":
pgm3.add_node(node, node, *coordinates[node], plot_params={'linewidth': 0})
else:
pgm3.add_node(node, node, *coordinates[node])
for edge in dag5_3.dag.edges:
pgm3.add_edge(*edge)
pgm3.render()
plt.gca().invert_yaxis()
```
Beginning on the left, the first possibility is that body mass (M) influences neocortex percent ($N$). Both then influence kilocalories in milk ($K$). Second, in the middle, neocortex could instead influence body mass. The two variables still end up correlated in the sample. Finally, on the right, there could be an unobserved variable $U$ that influences both $M$ and $N$, producing a correlation between them. In this book, I’ll circle variables that are unobserved. One of the threats to causal inference is that there are potentially many unobserved variables that influence an outcome or the predictors. We’ll consider this more in the next chapter.
Which of these graphs is right? We can’t tell from the data alone, because these graphs imply the same set of **conditional independencies**. In this case, there are no conditional independencies—each DAG above implies that all pairs of variables are associated, regardless of what we condition on. A set of DAGs with the same conditional independencies is known as a **Markov equivalence** set. Next, I’ll show you how to simulate observations consistent with each of these DAGs, how each can produce the masking phenomenon, and how to use the `CausalGraphicalModel` package to compute the complete set of Markov equivalent DAGs. Remember that while the data alone can never tell you which causal model is correct, your scientific knowledge of the variables will eliminate a large number of silly, but Markov equivalent, DAGs.
The final thing we’d like to do with these models is to make *counterfactual* plots again. Suppose the third DAG above is the right one. Then imagine manipulating $M$ and $N$, breaking the influence of $U$ on each. In the real world, such experiments are impossible. If we change an animal’s body size, natural selection would then change the other features to match it. But these counterfactual plots do help us see how the model views the association between each predictor and the outcome. Here is the code:
```{python}
def generate_counterfactual_data(dcc, xseq, hold_variable):
return {
'n': len(dcc),
'K': dcc.K.tolist(),
'N': dcc.N.tolist(),
'M': dcc.M.tolist(),
'n_tilde': len(xseq),
'N_seq': np.zeros_like(xseq).tolist() if hold_variable == 'N' else xseq.tolist(),
'M_seq': xseq.tolist() if hold_variable == 'N' else np.zeros_like(xseq).tolist()
}
def plot_counterfactual(ax, xseq, mu_mean, mu_plo, mu_phi, xlabel, title):
ax.plot(xseq, mu_mean, c="black")
ax.fill_between(xseq, mu_plo, mu_phi, alpha=0.2)
ax.set_xlabel(xlabel)
ax.set_ylabel("Kilocal per g (std)")
ax.set_title(title)
def plot_counterfactuals():
fig, axes = plt.subplots(1, 2, figsize=(9, 4))
counterfactuals = [
{"xseq": np.linspace(dcc.M.min() - 0.15, dcc.M.max() + 0.15, 30),
"hold_variable": "N",
"xlabel": "Log Body Mass (std)",
"title": "Counterfactual holding N = 0"},
{"xseq": np.linspace(dcc.N.min() - 0.15, dcc.N.max() + 0.15, 30),
"hold_variable": "M",
"xlabel": "Neocortex Percent (std)",
"title": "Counterfactual holding M = 0"}
]
for ax, cf in zip(axes, counterfactuals):
data = generate_counterfactual_data(dcc, cf["xseq"], cf["hold_variable"])
mu = m5_7_model.laplace_sample(data=data, draws=1000).stan_variable('mu_tilde')
mu_mean, mu_plo, mu_phi = utils.precis(mu)
plot_counterfactual(ax, cf["xseq"], mu_mean, mu_plo, mu_phi, cf["xlabel"], cf["title"])
plt.tight_layout()
plt.show()
plot_counterfactuals()
```
------------------------------------------------------------------------
#### Simulating a Masking Relationship
Just as with understanding spurious association, it may help to simulate data in which two meaningful predictors act to mask one another. In the previous section, I showed three DAGs consistent with this. To simulate data consistent with the first DAG:
```{python}
# M -> K <- N
# M -> N
n = 100
M = stats.norm.rvs(size=n)
N = stats.norm.rvs(loc=M)
K = stats.norm.rvs(loc=N-M)
d_sim = pd.DataFrame({'K':K, 'N': N, 'M':M})
```
You can quickly see the masking pattern of inferences by replacing `dcc` with `d_sim` in models `m5_5`, `m5_6`, and `m5_7`. Look at the summaries and you’ll see the same masking pattern where the slopes become more extreme in `m5_7`.
```{python}
data = {'n': len(d_sim),
'N': d_sim.N.tolist(),
'K': d_sim.K.tolist(),
'n_tilde': 2,
'N_seq': [-2, 2]}
m5_5_dsim = utils.StanQuap('stan_models/m5_5', m5_5, data=data,
generated_var=['mu_sim','mu', 'a_sim', 'bN_sim'])
print(m5_5_dsim.precis().round(2))
data = {'n': len(d_sim),
'M': d_sim.M.tolist(),
'K': d_sim.K.tolist(),
'n_tilde': 2,
'xseq': [-2, 2]}
m5_6_dsim = utils.StanQuap('stan_models/m5_6', m5_6, data=data,
generated_var=['mu', 'y_tilde'])
print(m5_6_dsim.precis().round(2))
data = {'n': len(d_sim),
'K': d_sim.K.tolist(),
'N': d_sim.N.tolist(),
'M': d_sim.M.tolist(),
'n_tilde': 2,
'N_seq': [-2, 2],
'M_seq': [-2, 2]}
m5_7_dsim = utils.StanQuap('stan_models/m5_7', m5_7, data=data, algorithm='LBFGS',
generated_var=['mu_tilde', 'K_tilde', 'mu'])
print(m5_7_dsim.precis().round(2))
```
```{python}
def plot_pairs():
_, ax = plt.subplots(3,3)
az.plot_pair({'K': K, 'M': M, 'N': N}, marginals=True, ax=ax);
plt.clf()
plot_pairs()
```
The other two DAGs can be simulated like this:
```{python}
# M -> K <- N
# N -> M
n = 100
N = stats.norm.rvs(size = n)
M = stats.norm.rvs(loc=N)
K = stats.norm.rvs(loc=N-M)
d_sim2 = pd.DataFrame({'K':K, 'N': N, 'M':M})
# M -> K <- N
# M <- U -> N
n = 100
U = stats.norm.rvs(size=n)
M = stats.norm.rvs(loc=U)
N = stats.norm.rvs(loc=U)
K = stats.norm.rvs(loc=N-M)
d_sim3 = pd.DataFrame({'K':K, 'N': N, 'M':M})
```
In the primate milk example, it may be that the positive association between large body size and neocortex percent arises from a tradeoff between lifespan and learning. Large animals tend to live a long time. And in such animals, an investment in learning may be a better investment, because learning can be amortized over a longer lifespan. Both large body size and large neocortex then influence milk composition, but in different directions, for different reasons. This story implies that the DAG with an arrow from $M$ to $N$, the first one, is the right one. But with the evidence at hand, we cannot easily see which is right. We should compute the **Markov equivalence** set and eliminate those that we think are not valid based upon our scientific knowledge of the variables.
We can compute the **Markov equivalence** set in R or Julia using the `daggity` package. The below DAGs were produced using `dagitty`:
``` r
dag5.7 <- dagitty( "dag{
M -> K <- N
M -> N }" )
coordinates(dag5.7) <- list( x=c(M=0,K=1,N=2) , y=c(M=0.5,K=1,N=0.5) )
MElist <- equivalentDAGs(dag5.7)
drawdag(MElist)
```
{fig-align="center" width="389"}
The `MElist` variable contains six different DAGs. Which of these do you think you could eliminate, based upon scientific knowledge of the variables?
------------------------------------------------------------------------