04c - The Many Variables & The Suprious Waffles.qmd

---
title: 04c - The Many Variables & The Spurious Waffles
author: Abdullah Mahmood
date: last-modified
format:
    html:
        theme: cosmo
        css: quarto-style/style.css
        highlight-style: atom-one
        mainfont: Palatino
        fontcolor: black
        monobackgroundcolor: white
        monofont: Menlo, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace
        fontsize: 13pt
        linestretch: 1.4
        number-sections: true
        number-depth: 4
        toc: true
        toc-location: right
        toc-depth: 4
        code-fold: false
        code-copy: true
        cap-location: bottom
        format-links: false
        embed-resources: true
        anchor-sections: true
        code-links:   
        -   text: GitHub Repo
            icon: github
            href: https://github.com/abdullahau/bayesian-analysis
        -   text: Quarto Markdown
            icon: file-code
            href: https://github.com/abdullahau/bayesian-analysis/blob/main/04c%20-%20The%20Many%20Variables%20%26%20The%20Suprious%20Waffles.qmd
        html-math-method:
            method: mathjax
            url: https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
---

### Imports

```{python}
# ruff: noqa: F405
from init import *
from causalgraphicalmodels import CausalGraphicalModel
import daft as daft

%config InlineBackend.figure_formats = ['svg']
```

## Categorical Variables

A common question for statistical methods is to what extent an outcome changes as a result of presence or absence of a category. A category here means discrete and unordered. For example, consider the different species in the milk energy data again. Some of them are apes, while others are New World monkeys. We might want to ask how predictions should vary when the species is an ape instead of a monkey. Taxonomic group is a **categorical variable**, because no species can be half-ape and half-monkey (discreteness), and there is no sense in which one is larger or smaller than the other (unordered). Other common examples of categorical variables include:

-   Sex: male, female
-   Developmental status: infant, juvenile, adult
-   Geographic region: Africa, Europe, Melanesia

Many readers will already know that variables like this, routinely called **factors**, can easily be included in linear models. But what is not widely understood is how these variables are represented in a model. The computer does all of the work for us, hiding the machinery. But there are some subtleties that make it worth exposing the machinery. Knowing how the machine works both helps you interpret the posterior distribution and gives you additional power in building the model.

------------------------------------------------------------------------

#### Continuous countries

With automated software and lack of attention, categorical variables can be dangerous. In 2015, a high-impact journal published a study of 1170 children from six countries, finding a strong negative association between religiosity and generosity. The paper caused a small stir among religion researchers, because it disagreed with the existing literature. Upon reanalysis, it was found that the country variable, which is categorical, was entered as a continuous variable instead. This made Canada (value 2) twice as much “country” as the United States (value 1). After reanalysis with country as a categorical variable, the result vanished and the original paper has been retracted. This is a happy ending, because the authors shared their data. How many cases like this exist, undiscovered because the data have never been shared and are possible lost forever?

------------------------------------------------------------------------

### Binary categories

In the simplest case, the variable of interest has only two categories, like male and female. Let’s rewind to the Kalahari data from earlier. Back then, we ignored sex when predicting height, but obviously we expect males and females to have different averages. Take a look at the variables available:

```{python}
d = pd.read_csv("data/Howell1.csv", sep=';')
d.head()
```

The `male` variable is our new predictor, an example of a **indicator variable**. Indicator variables—sometimes also called “dummy” variables—are devices for encoding unordered categories into quantitative models. There is no sense here in which “male” is one more than “female.” The purpose of the male variable is to indicate when a person in the sample is “male.” So it takes the value 1 whenever the person is male, but it takes the value 0 when the person belongs to any other category. It doesn’t matter which category is indicated by the 1. The model won’t care. But correctly interpreting the model demands that you remember, so it’s a good idea to name the variable after the category assigned the 1 value.

There are two ways to make a model with this information. The first is to use the indicator variable directly inside the linear model, as if it were a typical predictor variable. The effect of an indicator variable is to turn a parameter on for those cases in the category. Simultaneously, the variable turns the same parameter off for those cases in another category. This will make more sense, once you see it in the mathematical definition of the model. Consider again a linear model of height, from earlier. Now we’ll ignore weight and the other variables and focus only on sex.

$$
\begin{align*}
h_i &\sim \text{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha + \beta_m m_i \\
\alpha &\sim \text{Normal}(178,20) \\
\beta_m &\sim \text{Normal}(0,10)\\
\sigma &\sim \text{Uniform}(0,50) \\
\end{align*}
$$

where $h$ is height and $m$ is the dummy variable indicating a male individual. The parameter $β_m$ influences prediction only for those cases where $m_i = 1$. When $m_i = 0$, it has no effect on prediction, because it is multiplied by zero inside the linear model, $α + β_m m_i$, canceling it out, whatever its value. This is just to say that, when $m_i = 1$, the linear model is $\mu_i = α+β_m$. And when $m_i = 0$, the linear model is simply $\mu_i = α$.

Using this approach means that $β_m$ represents the expected *difference* between males and females in height. The parameter $α$ is used to predict both female and male heights. But male height gets an extra $β_m$. This also means that $α$ is no longer the average height in the sample, but rather just the average female height. This can make assigning sensible priors a little harder. If you don’t have a sense of the expected difference in height—what would be reasonable before seeing the data?—then this approach can be a bother. Of course you could get away with a vague prior in this case—there is a lot of data.

Another consequence of having to assign a prior to the difference is that this approach necessarily assumes there is more uncertainty about one of the categories—“male” in this case—than the other. Why? Because a prediction for a male includes two parameters and therefore two priors. We can simulate this directly from the priors. The prior distributions for $\mu$ for females and males are:

```{python}
mu_female = stats.norm.rvs(loc=178, scale=20, size=int(1e4))
mu_male = stats.norm.rvs(loc=178, scale=20, size=int(1e4)) + stats.norm.rvs(loc=0, scale=10, size=int(1e4))
utils.precis(pd.DataFrame({'mu_female': mu_female, 'mu_male': mu_male}))
```

The prior for males is wider, because it uses both parameters. While in a regression this simple, these priors will wash out very quickly, in general we should be careful. We aren’t actually more unsure about male height than female height, *a priori*. Is there another way?

Another approach available to us is an **index variable**. An index variable contains integers that correspond to different categories. The integers are just names, but they also let us reference a list of corresponding parameters, one for each category. In this case, we can construct our index like this:

```{python}
d['sex'] = np.where(d['male'] == 1, 2, 1)
d.head()
```

Now “1” means female and “2” means male. No order is implied. These are just labels. And the mathematical version of the model becomes:

$$
\begin{align*}
h_i &\sim \text{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha_{\text{sex}[i]} \\
\alpha_j &\sim \text{Normal}(178,20) \quad \text{for } j=1 \ldots 2 \\
\sigma &\sim \text{Uniform}(0,50) \\
\end{align*}
$$

What this does is create a list of $\alpha$ parameters, one for each unique value in the index variable. So in this case we end up with two $\alpha$ parameters, named $\alpha_1$ and $\alpha_2$. The numbers correspond to the values in the index variable `sex`. I know this seems overly complicated, but it solves our problem with the priors. Now the same prior can be assigned to each, corresponding to the notion that all the categories are the same, prior to the data. Neither category has more prior uncertainty than the other. And as you’ll see in a bit, this approach extends effortlessly to contexts with more than two categories.

Let’s approximate the posterior for the above model, the one using an index variable.

```{python}
m5_8 = """
data {
    int<lower=1> N;
    vector[N] height;
    array[N] int sex;
}
parameters {
    vector[2] a;
    real<lower=0, upper=50> sigma;
}
transformed parameters {
    vector[N] mu;
    mu = a[sex];
}
model {
    height ~ normal(mu, sigma);
    sigma ~ uniform(0, 50);
    a[1] ~ normal(178, 20);
    a[2] ~ normal(178, 20);
}
generated quantities {
    real diff_fm;
    diff_fm = a[1] - a[2];
}
"""

data = {
    'N': len(d),
    'height': d.height.tolist(),
    'sex': d.sex.tolist()
}

m5_8_model = utils.StanQuap('stan_models/m5_8', m5_8, data=data, generated_var=['diff_fm', 'mu'])

m5_8_model.precis().round(2)
```

Interpreting these parameters is easy enough—they are the expected heights in each category. But often we are interested in differences between categories. In this case, what is the expected difference between females and males? We can compute this using samples from the posterior. In fact, I’ll extract posterior samples into a data frame and insert our calculation directly into the same frame:

```{python}
post = m5_8_model.extract_samples(n=1000)
diff_fm = post['a'][:,0] - post['a'][:,1]
post_df = pd.DataFrame({'sigma': post['sigma'], 'a.1': post['a'][:,0], 'a.2': post['a'][:,1], 'diff_fm': diff_fm})
utils.precis(post_df).round(2)
```

Additionally, while setting up the model, I have added a `diff_fm` quantity inside the `generated quantities` block that directly computes the difference of $\alpha_1$ and $\alpha_2$ sampled from the posterior.

```{python}
post = m5_8_model.laplace_sample(draws=5).stan_variables()
post_df = pd.DataFrame({'sigma': post['sigma'], 'a.1': post['a'][:,0], 'a.2': post['a'][:,1], 'diff_fm': post['diff_fm']})
utils.precis(post_df).round(2)
```

Our calculation appears at the bottom, as a new parameter in the posterior. This is the expected difference between a female and male in the sample. This kind of calculation is called a **contrast**. No matter how many categories you have, you can use samples from the posterior to compute the contrast between any two.

### Many categories

Binary categories are easy, whether you use an indicator variable or instead an index variable. But when there are more than two categories, the indicator variable approach explodes. You’ll need a new indicator variable for each new category. If you have $k$ unique categories, you need $k−1$ indicator variables. Automated tools do in fact go this route, constructing $k−1$ indicator variables for you and returning $k−1$ parameters (in addition to the intercept).

But we’ll instead stick with the index variable approach. It does not change at all when you add more categories. You do get more parameters, of course, just as many as in the indicator variable approach. But the model specification looks just like it does in the binary case. And the priors continue to be easier, unless you really do have prior information about contrasts. It is also important to get used to index variables, because multilevel models depend upon them.

Let’s explore an example using the primate milk data again. We’re interested now in the `clade` variable, which encodes the broad taxonomic membership of each species:

```{python}
d = pd.read_csv("data/milk.csv", sep=';')
d.clade.unique()
```

We want an index value for each of these four categories. You could do this by hand, but just coercing the factor to an integer will do the job:

```{python}
d['clade_id'] = d.clade.astype('category').cat.codes + 1
```

Let’s use a model to measure the average milk energy in each clade. In math form:

$$
\begin{align*}
K_i &\sim \text{Normal}(\mu_i, \sigma) \\
\mu_i &= \alpha_{\text{CLADE}[i]} \\
\alpha_j &\sim \text{Normal}(0,0.5) \quad \text{for } j=1 \ldots 4 \\
\sigma &\sim \text{Exponential}(1) \\
\end{align*}
$$

Remember, $K$ is the standardized kilocalories. We have widened the prior on $\alpha$ a little, to allow the different clades to disperse, if the data wants them to. But I encourage you to play with that prior and repeatedly re-approximate the posterior so you can see how the posterior differences among the categories depend upon it. Firing up `StanQuap` now:

```{python}
d['K'] = utils.standardize(d['kcal.per.g'])

m5_9 = """
data {
    int N;
    vector[N] K;
    array[N] int clade_id;
}
parameters {
    real<lower=0> sigma;
    vector[4] a;
}
transformed parameters {
    vector[N] mu;
    mu = a[clade_id];
}
model {
    K ~ normal(mu, sigma);
    sigma ~ exponential(1);
    a ~ normal(0, .5);
}
"""

data = {
    'N': len(d),
    'K': d.K.tolist(),
    'clade_id': d.clade_id.tolist()
}

m5_9_model = utils.StanQuap('stan_models/m5_9', m5_9, data=data, generated_var=['mu'])
summary = m5_9_model.precis()
summary.round(2)
```

```{python}
def plot_forest():
    y_range = np.arange(4)
    plt.hlines(
        y_range[0],
        xmin=summary['5.5%']['a.4'], 
        xmax=summary['94.5%']['a.4'], linewidth=2
    )
    plt.hlines(
        y_range[1],
        xmin=summary['5.5%']['a.3'], 
        xmax=summary['94.5%']['a.3'], linewidth=2
    )
    plt.hlines(
        y_range[2],
        xmin=summary['5.5%']['a.2'], 
        xmax=summary['94.5%']['a.2'], linewidth=2
    )
    plt.hlines(
        y_range[3],
        xmin=summary['5.5%']['a.1'], 
        xmax=summary['94.5%']['a.1'], linewidth=2
    )
    mean_range = np.array([
        summary['Mean']['a.4'],
        summary['Mean']['a.3'],
        summary['Mean']['a.2'],
        summary['Mean']['a.1']
    ])
    plt.plot(mean_range, y_range, 'o', fillstyle='none')
    plt.axvline(0, linestyle='--', linewidth=0.3)
    plt.yticks(y_range, labels=['a.4: Strepsirrhine', 'a.3: Old World Monkey', 'a.2: New World Monkey', 'a.1: Ape'])
    plt.xlabel('Expected kcal (std)')

plt.clf()
plot_forest()
```

In practice, we have to be very careful to keep track of which index values go with which categories.

If you have another kind of categorical variable that you’d like to add to the model, the approach is just the same. For example, let’s randomly assign these primates to some made up categories: \[1\] Gryffindor, \[2\] Hufflepuff, \[3\] Ravenclaw, and \[4\] Slytherin.

```{python}
np.random.seed(63)
d['house'] = np.random.choice(np.tile(np.arange(1, 5), 8), len(d), replace=False)
d.head()
```

Now we can include these categories as another predictor in the model:

```{python}
m5_10 = """
data {
    int N;
    vector[N] K;
    array[N] int clade_id;
    array[N] int house;
}
parameters {
    real<lower=0> sigma;
    vector[4] a;
    vector[4] h;
}
transformed parameters {
    vector[N] mu;
    mu = a[clade_id] + h[house];
}
model {
    K ~ normal(mu, sigma);
    sigma ~ exponential(1);
    a ~ normal(0, .5);
    h ~ normal(0, .5);
}
"""

data = {
    'N': len(d),
    'K': d.K.tolist(),
    'clade_id': d.clade_id.tolist(),
    'house': d.house.tolist()
}

m5_10_model = utils.StanQuap('stan_models/m5_10', m5_10, data=data, generated_var=['mu'])
summary = m5_10_model.precis()
summary.round(2)
```

If you inspect the posterior, you’ll see that \[h.1\] Gryffindor and \[h.4\] Slytherin stands out.

------------------------------------------------------------------------

#### Differences and statistical significance

A common error in interpretation of parameter estimates is to suppose that because one parameter is sufficiently far from zero—is “significant”—and another parameter is not—is “not significant”—that the difference between the parameters is also significant. This is not necessarily so. This isn’t just an issue for non-Bayesian analysis: If you want to know the distribution of a difference, then you must compute that difference, a **contrast**. It isn’t enough to just observe, for example, that a slope among males overlaps a lot with zero while the same slope among females is reliably above zero. You must compute the posterior distribution of the difference in slope between males and females. For example, suppose you have posterior distributions for two parameters, $\beta_f$ and $\beta_m$. $\beta_f$’s mean and standard deviation is $0.15±0.02$, and $\beta_m$’s is $0.02±0.10$. So while $\beta_f$ is reliably different from zero (“significant”) and $\beta_m$ is not, the difference between the two (assuming they are uncorrelated) is $(0.15−0.02) ± \sqrt{0.02^2 + 0.1^2} ≈ 0.13 ± 0.10.$ The distribution of the difference overlaps a lot with zero. In other words, you can be confident that $\beta_f$ is far from zero, but you cannot be sure that the difference between $\beta_f$ and $\beta_m$ is far from zero.

In the context of non-Bayesian significance testing, this phenomenon arises from the fact that statistical significance is inferentially powerful in one way: difference from the null. When $\beta_m$ overlaps with zero, it may also overlap with values very far from zero. Its value is uncertain. So when you then compare $\beta_m$ to $\beta_f$, that comparison is also uncertain, manifesting in the width of the posterior distribution of the difference $\beta_f−\beta_m$. Lurking underneath this example is a more fundamental mistake in interpreting statistical significance: The mistake of accepting the null hypothesis. Whenever an article or book says something like “we found no difference” or “no effect,” this usually means that some parameter was not significantly different from zero, and so the authors adopted zero as the estimate. This is both illogical and extremely common.

------------------------------------------------------------------------

## Summary

This chapter introduced multiple regression, a way of constructing descriptive models for how the mean of a measurement is associated with more than one predictor variable. The defining question of multiple regression is: *What is the value of knowing each predictor, once we already know the other predictors?* The answer to this question does not by itself provide any causal information. Causal inference requires additional assumptions. Simple directed acyclic graph (DAG) models of causation are one way to represent those assumptions. In the next chapter we’ll continue building the DAG framework and see how adding predictor variables can create as many problems as it can solve.

## Practice Problems

### 5E1

Which of the linear models below are multiple linear regressions?

\(2\) $\mu_i = \beta_x x_i + \beta_z z_i$ and (4) $\mu_i = \alpha + \beta_x x_i + \beta_z z_i$

Only (2) and (4) are multiple linear regressions. Both have more than one predictor variable and corresponding coefficients in the linear model. The model (1) has only a single predictor variable, $x$. The model (3) has two predictor variables, but only their difference for each case enters the model, so effectively this is a uni-variate regression, with a single slope parameter.

### 5E2

Write down a multiple regression to evaluate the claim: *Animal diversity is linearly related to latitude, but only after controlling for plant diversity*. You just need to write down the model definition.

A verbal model statement like this will always be somewhat ambiguous. That is why mathemat- ical notation is needed in scientific communication. However, the conventional interpretation of the statement would be:

$$
\begin{align*}
A_i &\sim \text{Normal}(\mu_i , \sigma) \\ 
\mu_i &= \alpha + \beta_L L_i - \beta_P P_i \\
\end{align*}
$$

where A is animal diversity, L is latitude, and P is plant diversity. This linear model “controls” for plant diversity, while estimating a linear relationship between latitude and animal diversity.

### 5E3

Write down a multiple regression to evaluate the claim: *Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree*. Write down the model definition and indicate which side of zero each slope parameter should be on.

Define $T$ as time to PhD degree, the outcome variable implied by the problem. Define $F$ as amount of funding and $S$ as size of laboratory, the implied predictor variables. Then the model (ignoring priors) might be:

$$
\begin{align*}
Ai &\sim \text{Normal}(\mu_i , \sigma) \\ 
\mu_i &= \alpha + \beta_F F_i - \beta_S S_i \\
\end{align*}
$$

The slopes $\beta_F$ and $\beta_S$ should both be positive. How can both be positively associated with the outcome in a multiple regression, but neither by itself? If they are negatively correlated with one another, then considering each alone may miss the positive relationships with the outcome. For example, large labs have less funding per student. Small labs have more funding per student, but poorer intellectual environments. So both could be positive influences on time to degree, but be negatively associated in nature.

### 5E4

Suppose you have a single categorical predictor with 4 levels (unique values), labeled $A$, $B$, $C$ and $D$. Let $A_i$ be an indicator variable that is $1$ where case $i$ is in category $A$. Also suppose $B_i$, $C_i$, and $D_i$ for the other categories. Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression? Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model.

1.  $\mu_i = \alpha + \beta_A A_i + \beta_B B_i + \beta_D D_i$

2.  $\mu_i = \alpha + \beta_A A_i + \beta_B B_i + \beta_C C_i + \beta_D D_i$

3.  $\mu_i = \alpha + \beta_B B_i + \beta_C C_i + \beta_D D_i$

4.  $\mu_i = \alpha_A A_i + \alpha_B B_i + \alpha_C C_i + \alpha_D D_i$

5.  $\mu_i = \alpha_A (1-B_i - C_i - D_i) + \alpha_B B_i + \alpha_C C_i + \alpha_D D_i$

First, the answer will actually depend upon the priors, which aren’t mentioned in the problem. But assuming weakly informative or flat priors, the answer is that (1), (3), (4), and (5) are inferentially equivalent. They’ll make the same predictions, and you can convert among them after model fitting. (2) stands out because it has a redundant parameter, the intercept $α$.

### 5M1

Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).

The are many good answers to this question. The easiest approach is to think of some context that follows the divorce rate pattern in the chapter: one predictor influences both the outcome and the other predictor. For example, we might consider predicting whether or not a scientific study replicates. Two predictor variables are available: (1) sample size and (2) statistical significance. Sample size influences both statistical significance and reliability of a finding. This induces a correlation between significance and successful replication, even though significance is not associated with replication, once sample size is taken into account.

### 5M2

Invent your own example of a masked relationship. An outcome variable should be correlated with both predictor variables, but in opposite directions. And the two predictor variables should be correlated with one another.

Again, many good answers are possible. The pattern from the milk energy example in the chapter is the simplest. Consider for example the influences of income and drug use on health. Income is positively associated, in reality, with health. Drug use is, for the sake of the example, negatively associated with health. But wealthy people consume more drugs than the poor, simply because the wealthy can afford them. So income and drug use are positively associated in the population. If this positive association is strong enough, examining either income or drug use alone will show only a weak relationship with health, because each works on health in opposite directions.

### 5M3

It is sometimes observed that the best predictor of fire risk is the presence of firefighters— States and localities with many firefighters also have more fires. Presumably firefighters do not cause fires. Nevertheless, this is not a spurious correlation. Instead fires cause firefighters. Consider the same reversal of causal inference in the context of the divorce and marriage data. How might a high divorce rate cause a higher marriage rate? Can you think of a way to evaluate this relationship, using multiple regression?

Divorce might lead to, or be in expectation of, remarriage. Thus divorce could cause marriage rate to rise. In order to examine this idea, or another like it, the data would need to be structured into more categories, such as remarriage rate versus first marriage rate. Better yet would be longitudinal data. In many real empirical contexts, causation involves feedback loops that can render regression fairly useless, unless some kind of time series framework is used.

### 5M4

In the divorce data, States with high numbers of members of the Church of Jesus Christ of Latter-day Saints (LDS) have much lower divorce rates than the regression models expected. Find a list of LDS population by State and use those numbers as a predictor variable, predicting divorce rate using marriage rate, median age at marriage, and percent LDS population (possibly standardized). You may want to consider transformations of the raw percent LDS variable.

It is worth finding and entering the values yourself, for the practice at data management. But here are the values I found, scraped from Wikipedia and merged into the original data:

```{python}
# load data
d = pd.read_csv("data/WaffleDivorce.csv", sep=';')

# standardize variables
d['D'] = utils.standardize(d.Divorce)
d['M'] = utils.standardize(d.Marriage)
d['A'] = utils.standardize(d.MedianAgeMarriage)

d['pct_LDS'] = [0.75, 4.53, 6.18, 1, 2.01, 2.82, 0.43, 0.55, 0.38,
                0.75, 0.82, 5.18, 26.35, 0.44, 0.66, 0.87, 1.25, 0.77, 0.64, 0.81,
                0.72, 0.39, 0.44, 0.58, 0.72, 1.14, 4.78, 1.29, 0.61, 0.37, 3.34,
                0.41, 0.82, 1.48, 0.52, 1.2, 3.85, 0.4, 0.37, 0.83, 1.27, 0.75,
                1.21, 67.97, 0.74, 1.13, 3.99, 0.92, 0.44, 11.5]
d.head()
```

A first regression model including this variable might be:

```{python}
m_5M4 = """
data {
    int N;
    vector[N] Divorce;
    vector[N] Marriage;
    vector[N] MedianAgeMarriage;
    vector[N] pct_LDS;
    
}
parameters {
    real<lower=0, upper=10> sigma;
    real a;
    real bR;
    real bA;
    real bM;
}
transformed parameters {
    vector[N] mu;
    mu = a + bR * Marriage + bA * MedianAgeMarriage + bM * pct_LDS;
}
model {
    Divorce ~ normal(mu, sigma);
    sigma ~ uniform(0, 10);
    a ~ normal(0, 100);
    bR ~ normal(0, 10);
    bA ~ normal(0, 10);
    bM ~ normal(0, 10);
}
"""

data = {
    'N': len(d),
    'Divorce': d.Divorce.tolist(),
    'Marriage': d.Marriage.tolist(),
    'MedianAgeMarriage': d.MedianAgeMarriage.tolist(),
    'pct_LDS': d.pct_LDS.tolist()
}

m_5M4_model = utils.StanQuap(
    'stan_models/m_5M4', m_5M4, 
    data=data, generated_var=['mu'], 
    algorithm='LBFGS'
)

m_5M4_model.precis().round(2)
```

As expected, there is a negative association between percent LDS and divorce rate. This model assumes the relationship between divorce rate and percent LDF is linear. This makes sense if the LDS community has a lower divorce rate within itself only, and so as it makes up more of a State’s population, that State’s divorce rate declines. This is to say that the expected divorce rate of State $i$ is a “convex” mix of two average divorce rates:

$$
D_i = (1-P)D_G + PD_{LDS}
$$

where $D_i$ is the divorce rate for State $i$, $P$ is the proportion of the State’s population that is LDS, and the two divorce rates $D_G$ and $D_{LDS}$ are the divorce rates for gentiles (non-LDS) and LDS, respectively. If $D_G > D_{LDS}$, then as $P$ increases, the value of $D_i$ increases linearly as well. But maybe the percent LDS in the population has a secondary impact as a marker of a State-level cultural environment that has lower divorce in more demographic groups than just LDS. In that case, this model will miss that impact. Can you think of a way to address this? (Maybe, we should evaluate the significance of the differences in $D_{G}$ between different states based on $1-P$, proportion of the population that is not a member of the LDS)

### 5M5

One way to reason through multiple causation hypotheses is to imagine detailed mechanisms through which predictor variables may influence outcomes. For example, it is sometimes argued that the price of gasoline (predictor variable) is positively associated with lower obesity rates (outcome variable). However, there are at least two important mechanisms by which the price of gas could reduce obesity. First, it could lead to less driving and therefore more exercise. Second, it could lead to less driving, which leads to less eating out, which leads to less consumption of huge restaurant meals. Can you outline one or more multiple regressions that address these two mechanisms? Assume you can have any predictor data you need.

This is an open-ended question with many good, expanding answers. Here’s the basic outline of an approach. The first two implied variables are the rate of obesity $O$ and the price of gasoline $P$. The first proposed mechanism suggests that higher price $P$ reduces driving $D$, which in turn increases exercise $X$, which then reduces obesity $O$. As a set of regressions, this mechanism implies:

1.  $D$ as a declining function of $P$

2.  $X$ as a declining function of $D$

3.  $O$ as a declining function of $X$

In other words, for the mechanism to work, each predictor above needs be negatively associated with each outcome. Note that each outcome becomes a predictor. That’s just how these causal chains look. A bunch of reasonable control variables could be added to each of the regressions above. Consider for example that a very wealthy person will be more insensitive to changes in price, so we might think to interact $P$ with income. The second proposed mechanism suggests that price $P$ reduces driving $D$ which reduces eating out $E$ which reduces obesity $O$. A similar chain of regressions is implied:

1.  $D$ as a declining function of $P$

2.  $E$ as an increasing function of $D$

3.  $O$ as an increasing function of $E$

### 5H1

In the divorce example, suppose the DAG is: $M →A →D$. What are the implied conditional independencies of the graph? Are the data consistent with it?

```{python}
dag_5H1 = CausalGraphicalModel(nodes=["A", "D", "M"], edges=[("M", "A"), ("A", "D")])
pgm = daft.PGM()
coordinates = {"A": (0, 0), "D": (1, 1), "M": (2, 0)}
for node in dag_5H1.dag.nodes:
    pgm.add_node(node, node, *coordinates[node], plot_params={'linewidth': 0})
for edge in dag_5H1.dag.edges:
    pgm.add_edge(*edge)
pgm.render()
plt.gca().invert_yaxis()
```

```{python}
print('DAG:', dag_5H1.get_distribution())

# get_all_independence_relationships() method 
# Returns a list of all pairwise conditional independence relationships 
# implied by the graph structure.

print(dag_5H1.get_all_independence_relationships())
```

### 5H2

Assuming that the DAG for the divorce example is indeed $M→A→D$, fit a new model and use it to estimate the counterfactual effect of halving a State’s marriage rate $M$. Using the counterfactual example from the chapter as a template.