README.Rmd

---
title: 'POLS 503: Assigment 2'
date: '2017-04-21'
output:
  md_document: default
  html_document:
    includes:
      before_body: includes/before_body.html
  pdf_document:
    includes:
      in_header: includes/in_header.tex
    keep_tex: yes
bibliography: assignment2.bib
---


This assignment works through an example in @Yule1899a:

@Yule1899a is a published example multiple regression analysis in its modern form.[^yule]

Yule wrote this paper to analyze the effect of policy changes and implementation on pauperism (poor receiving benefits) in England under the [English Poor Laws](https://en.wikipedia.org/wiki/English_Poor_Laws). In 1834, a new poor law was passed that established a national welfare system in England and Wales. The New Poor Law created new administrative districts (Poor Law Unions) to adminster the law. Most importantly, it attempted to standardize the provision of aid to the poor. There were two types of aid provided: in-relief or aid provided to paupers in workhouses where they resided, and out-relief or aid provided to paupers residing at home. The New Poor Law wanted to decrease out-relief and increase in-relief in the belief that in-relief, in particular the quality of life in workhouses, was a deterrence to poverty and an encouragement for the poor to work harder to avoid poverty.

Yule identifies that there are various potential causes of the change in rate of pauperism, including changes in the (1) law, (2) economic conditions, (3) general social character, (4) moral character, (5) age distribution of the population (pg. 250).

He astutely notes the following:

> If, for example, we should find an increase in the proportion of out-relief associated with (1) an increase in the proportion of the aged to the whole population, and also (2) an increase in the rate of pauperism, it might be legitimate to interpret the result in the sense that changes in out-relief and pauperism were merely simultaneous concomitants of changes in the proportion of aged-the change of pauperism not being a direct consequence of the change of administration, but both direct consequenices of the change in age distribution. It is evidently most important that we should be able to decide between two such differenit ilnterpretations of the same facts. This the method I have used is perfectly competernt to do --- @Yule1899a [pg. 250]

[^yule]: See @Freedman_1997, @Stigler1990a, @Stigler2016a, and @Plewis2017a for discussions of @Yule1899a.

# Setup

```{r message=FALSE}
library("tidyverse")
library("modelr")
# devtools::install_github("jrnold/resamplr")
library("resamplr")
```

While only a subset of the original data of @Yule1899a was printed in the article itself, @Plewis2015a reconstructed the orginal data and @Plewis2017a replicated the original paper. This data is included in the package **datums**. This package is not on CRAN, but can be downloaded from github.
**IMPORTANT** install the latest version of **datums** since a few fixes were recently made to the `pauperism` dataset.
```{r}
# devtools::install_github("jrnold/datums")
library("datums")
```

The data for @Yule1899a is split into two data frames: `pauperism_plu` contains data on the Poor Law Unions (PLU), and `pauperism_year`, panel data with the PLU-year as the unit of observation.
```{r}
pauperism <-
  left_join(datums::pauperism_plu, datums::pauperism_year,
            by = "ID") %>%
  mutate(year = as.character(year))
```
The data consist of `r length(unique(pauperism$ID))` PLUs and the years: 1871, 1881, 1891 (years in which there was a UK census).

@Yule1899a is explcitly using regression for causal inference. The outcome variable of interest is:

- **Pauperism** the percentage of the population in receipt of relief of any kind, less lunatics and vagrants

The treatment (policy intervention) is the ration of numbers receiving outdoor relief to those receiving indoor relief.

- **Out-Relief Ratio:** the ratio of numbers relieved outdoors to those relieved indoors

He will control for two variables that may be associated with the treatment

- **Proportion of Old:** the proportion of the aged (65 years) to the whole population since the old are more likely to be poor.
- **Population:** in particular changes in population that may be proxying for changes in the economic, social, or moral factors of PLUs.

There is also **Grouping of Unions**, which is a locational classification based on population density that consists of Rural, Mixed, Urban, and Metropolitan.

Instead of taking differences or percentages, Yule worked with "percent ratio differences", $100 \times \frac{x_{t}}{x_{t-1}}$, because he did not want to work with negative signs, presumably a concern at the because he was doing arithmetic by hand and this would make calculations more tedious or error-prone.


## Original Specification

Run regressions of `pauper` using the yearly level data with the following specifications. 
In @Yule1899a, the regressions are

- *M1:* `paupratiodiff ~ outratiodiff + year + Type`
- *M2:* `paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * (year + Type)`
- *M3:* `-1  + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type)`
- *M4:* `paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type)`

1. Present the regressions results in a regression table
2. Interpret the coefficients for `outratiodiff` for each model.
3. Write the equations for each or all models, and describe the model with a sentence or two. Try to be as concise as possible. Look at recent journal articles for examples of the wording and format.
4. What is the difference between *M3* and *M4*. What are the pros and cons of each parameterization?
4. Conduct F-tests on the hypotheses:
  
  1. All interactions in *M4* are 0
  2. The coefficients on `outratiodiff` in *M4* are the same across years
  3. The coefficients on `outratiodiff` in *M4* are the same across PLU Types
  4. The coefficients on `outratiodiff` in *M4* are the same across PLU Types and years.
  
  You can conduct F-tests with the function `anova(mod_unrestricted, mod_restricted)`.

5. Calculate the predicted value and confidence interval for the PLU with the median value of `outratiodiff`, `popratiodiff`, and `oldratiodiff` in each year and PLU Type for these models.
  Plot the predicted value and confidence interval of these as point-ranges.
6. As previously, calculate the predicted value of the median PLU in each year and PLU Type. But instead of confidence intervals include the prediction interval. How do the confidence and prediction intervals differ? What are their definitions?


## Functional Forms

The regression line of the model estimated in @Yule1899a (ignoring the year and region terms and interactions) can be also written as
$$
\begin{aligned}[t]
100 \times \frac{\mathtt{pauper2}_t / \mathtt{Popn2_t}}{\mathtt{pauper2}_{t-1} / \mathtt{Popn2_{t-1}}} 
&= \beta_0 + \beta_1 \times 100 \times \frac{\mathtt{outratio}_t}{\mathtt{outratio_{t-1}}} \\
& \quad + \beta_2 \times 100 \times \frac{\mathtt{Popn65}_t / \mathtt{Popn2}_{t}}{\mathtt{Popn65}_{t-1} / \mathtt{Popn2}_{t-1}} + \beta_3 \times 100 \times \frac{\mathtt{Popn2}_t}{\mathtt{Popn2}_{t - 1}}
\end{aligned}
$$

1. Write a model that includes only the log differences ($\log(x_t) - \log(x_{t - 1})$) with only the `pauper2`, `outratio`, `Popn2`, and `Popn65` variables.
2. Estimate the model with logged difference predictors, Year, and month and interpret the coefficient on $\log(outratio_t)$.
3. What are the pros and cons of this parameterization of the model relative to the one in @Yule1899a? Focus on interpretation and the desired goal of the inference rather than the formal tests of the regression. Can you think of other, better functional forms?


## Non-differenced Model

Suppose you estimate the model (*M5*) without differencing,
```
pauper2 ~ outratio + (Popn2 + Prop65) * (year + Type)
```

- Interpret the coefficient on `outratio`. How is this different than model *M2*? 
- What accounts for the different in sample sizes in *M5* and *M2*?
- What model do you think will generally have less biased estimates of the effect of out-relief on pauperism: *M5* or *M2*? Explain your reasoning.


## Substantive Effects

Read @Gross2014a and @McCaskeyRainey2015a. Use the methods described in those papers to assess the substantive effects of out-ratio on the rate of pauperism. Use the model(s) of your choosing.


## Influential Observations and Outliers

### Influential Observations for the Regression

For this use *M2*:

1. For each observation, calculate and explain the following:

  - hat value (`hatvalues`)
  - standardized error (`rstandard`)
  - studentized error  (`rstudent`)
  - Cook's distance (`cooksd`)

2. Create an outlier plot and label any outliers. See the example [here](https://jrnold.github.io/intro-methods-notes/outliers.html#iver-and-soskice-data)
3. Using the plot and rules of thumb identify outliers and influential observations


## Influential Observations for a Coefficient

1. Run *M2*, deleting each observation and saving the coefficient for `outratiodiff`. This is a method called the jackknife. You can use a for loop to do this, or you can use the function `jackknife` in the package [resamplr](https://github.com/jrnold/resamplr).
  
    1. For which observations is there the largest change in the coefficient on `outratiodiff`?
    2. Which observations have the largest effect on the estimate of `outratiodiff`? 
    3. How do these observations compare with those that had the largest effect on the overall regression as measured with Cook's distance?
    4. Compare the results of the jackknife to the `dfbeta` statistic for `outratiodiff`

2. @AronowSamii2015a note that the influence of observations in a regression coefficient is different than the the influence of regression observations in the entire regression. Calculate the observation weights for `outratiodiff`.

    1. Regress `outratiodiff` on the control variables
    2. The weights of the observations are those with the highest squared errors from this regression. Which observations have the highest coefficient values? 
    3. How do the observations with the highest regression weights compare with those with the highest changes in the regression coefficient from the jackknife?


## Omitted Variable Bias

An informal way to assess the potential impact of omitted variables on the coeficient of the variable of interest is to coefficient variation when covariates are added as a measure of the potential for omitted variable bias [@Oster2016a].
@NunnWantchekon2011a (Table 4) calculate a simple statistic for omitted variable bias in OLS. This statistic "provide[s] a measure to gauge the strength of the likely
bias arising from unobservables: how much stronger selection on unobservables,
relative to selection on observables, must be to explain away the full estimated
effect."

1. Run a regression without any controls. Denote the coefficient on the variable of interest as $\hat\beta_R$.
2. Run a regression with the full set of controls. Denote the coefficient on the variable of interest in this regression as $\hat\beta_F$. 
3. The ratio is $\hat\beta_F / (\hat\beta_R - \hat\beta_F)$

Calculate this statistic for *M2* and interpret it.


## Heteroskedasticity


### Robust Standard Errors

1. Run *M2* and *M3*  with a heteroskedasticity consistent (HAC) or robust standard error. How does this affect the standard errors on `outratio` coefficients? Use the **sandwich** package to add HAC standard errors [@Zeileis2004a].


### Multiple Regressions

1. Run the model with interactions for all years and types
    ```r
    lm(pauper2 ~ (outratio + Popn2 + Prop65) * year * Type - 1, data = pauperism)
    ```
2. For each subset of `year` and `type` run the regression
    ```r
    lm(pauper2 ~ outratio + Popn2 + Prop65)
    ``` 

3. Compare the coefficients, standard errors, and regression standard errors in these regresions.


To run the multiple regressions, save models as a list column `mod`, 
then save the results of `glance` and `tidy` as list columns:
```{r}
all_interact <-
  crossing(Type = pauperism$Type, year = c(1881, 1891)) %>%
  mutate(mod = map2(year, Type, 
                    function(yr, ty) {
                    lm(paupratiodiff ~ outratiodiff + popratiodiff + oldratiodiff,
                       data = filter(pauperism,
                                      year == yr,
                                      Type == ty))
                    })) %>%
  mutate(mod_glance = map(mod, broom::glance),
         mod_tidy = map(mod, broom::tidy))
```

Now extract parts of model.
E.g. Standard errors of the regression: 
```{r}
all_interact %>%
  mutate(sigma = map_dbl(mod_glance, function(x) x$sigma)) %>%
  select(year, Type, sigma)
```


## Weighted Regression

1. Run *M2* and *M3* as weighted regressions, weighted by the population (`Popn`) and interpret the coefficients on `outratiodiff` and interactions. Informally assess the extent to which the coefficients are different. Which one does it seem to affect more? 
2. What are some rationales for weighting by population? See the discussion in @SolonHaiderWooldridge2013a and @AngristPischke2014a.


## Cross-Validation

When using regression for causal inference, model specification and choice should largely be based on avoiding omitted variables. 
Another criteria for selecting models is to use their fit to the data.
But a model's fit to data should not be assessed using only the in-sample data.
That leads to overfitting---and the best model would always be to include an indicator variable for every observation
Instead, a model's fit to data can be assessed by using its out-of-sample fit.
One way to estimate the *expected* fit of a model to *new* data is cross-validation.

We want to compare the predictive performance of the following models
```{r}
mod_formulas <- 
  list(
    m0 = paupratiodiff ~ 1,
    m1 = paupratiodiff ~ year + Type,    
    m2 = paupratiodiff ~ outratiodiff + year + Type,
    m3 = paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * (year + Type),
    m4 = -1  + paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * (year + Type),
    m5 = paupratiodiff ~ (outratiodiff + popratiodiff + oldratiodiff) * year * Type
  )
```

Let's split the data into 10 (train/test) folds for cross-validation,
```{r}
pauperism_nonmiss <- 
  pauperism %>%
  filter(year %in% c(1881, 1891)) %>%
  select(paupratiodiff, outratiodiff, popratiodiff, oldratiodiff, year, Type, Region, ID) %>%
  tidyr::drop_na()
pauperism_10folds <-
  pauperism_nonmiss %>%
  resamplr::crossv_kfold(10)
```


For each model formula `f`, training data set `train`, and test data set, `test`, 
run the model specified by `f` on `train`, and predict new observations in `test`, and calculate the RMSE from the residuals
```{r}
mod_rmse_fold <- function(f, train, test) {
  fit <- lm(f, data = as.data.frame(train))
  test_data <- as.data.frame(test)
  err <- test_data$paupratiodiff - predict(fit, newdata = test_data)
  sqrt(mean(err ^ 2))
}
```
E.g. for one fold and formula,
```{r}
mod_rmse_fold(mod_formulas[[1]], pauperism_10folds$train[[1]],
              pauperism_10folds$test[[1]])
```

Now write a function that will calculate the average RMSE across folds for a formula and a cross-validation data frame with `train` and `test` list-columns:
```{r}
mod_rmse <- function(f, data) {
  map2_dbl(data$train, data$test, 
           function(train, test) {
             mod_rmse_fold(f, train, test)
           }) %>%
    mean()
}
```
```{r}
mod_rmse(mod_formulas[[1]], pauperism_10folds)
```

Finall, we want to run `mod_rmse` for each formula in `mod_formulas`.
It will be easiest to store this in a data frame:
```{r}
cv_results <- tibble(
  model_formula = mod_formulas,
  .id = names(mod_formulas),
  # Formula as a string
  .name = map(model_formula,
              function(x) gsub(" +", " ", paste0(deparse(x), collapse = "")))
)
```
Use `map` to run `mod_rmse` for each model and save it as a list frame in
the data frame,
```{r}
cv_results <-
  mutate(cv_results,
         cv10_rmse = map(model_formula, mod_rmse, data = pauperism_10folds))
```

In the case of linear regression, the MSE of the Leave-one-out ($n$-fold) cross-validation can be analytically calculated without having to run $n$ regressions.
```{r}
loocv <- function(x) {
  mean((residuals(x) / (1 - hatvalues(x))) ^ 2)
}
```
We 
```{r}
cv_results <- 
  mutate(cv_results, 
         rmse_loo = map(mod_formulas, function(f) sqrt(loocv(lm(f, data = pauperism_nonmiss)))))
```


1. In the 10-fold cross validation, which model has the best out of sample prediction? 
2. Using the LOO-CV cross-validation, which model has the best 
3. Does the prediction metric (RMSE) and prediction task---predicting individual PLUs from other PLUs---make sense? Can you think of others that you would prefer?


## Bootstrapping

Estimate the 95% confidence intervals of model with simple non-parametric bootstrapped standard errors. The non-parametric bootstrap works as follows:

Let $\hat\theta$ be the estimate of a statistic. To calculate bootstrapped standard errors and confidence intervals use the following procedure.

For samples $b = 1, ..., B$.

1. Draw a sample with replacement from the data
2. Estimate the statistic of interest and call it $\theta_b^*$.

Let $\theta^* = \{\theta_1^*, \dots, \theta_B^*\}$ be the set of bootstrapped statistics.

- standard error: $\hat\theta$ is $\sd(\theta^*)$.
- confidence interval:

    - normal approximation. This calculates the confidence interval as usual but uses the bootstrapped standard error instead of the classical OLS standard error: $\hat\theta \pm t_{\alpha/2,df} \cdot \sd(\theta^*)$
    - quantiles: A 95% confidence interval uses the 2.5% and 97.5% quantiles of $\theta^*$ for its upper and lower bounds.


Original model
```{r}
mod_formula <- paupratiodiff ~ outratiodiff + (popratiodiff + oldratiodiff) * year * Type
mod_orig <- lm(mod_formula, data = pauperism_nonmiss)
```

```{r}
bs_coef_se <-
  resamplr::bootstrap(pauperism_nonmiss, 1024) %>%
  # extract the strap column
  `[[`("sample") %>%
  # run 
  map_df(function(dat) {
    lm(mod_formula, data = dat) %>%
    broom::tidy() %>%
    select(term, estimate)
  }) %>%
  # calculate 2.5%, 97.5% and sd of estimates
  group_by(term) %>%
  summarise(
    std.error_bs = sd(estimate),
    conf.low_bsq = quantile(estimate, 0.025),
    conf.low_bsq = quantile(estimate, 0.975)
  )
```

Now compare the std.error of the original and the bootstrap for `outratiodiff`
```{r}
broom::tidy(mod_orig, conf.int = TRUE) %>%
  select(term, estimate, std.error) %>%
  filter(term == "outratiodiff") %>%
  left_join(bs_coef_se, by = "term")
```
The bootstrap standard error is slightly higher.
It is similar to the standard error generated using the heteroskedasticity consistent standard error.
```{r}
sqrt(sandwich::vcovHC(mod_orig)["outratiodiff", "outratiodiff"])
```

It is likely that there is correlation between the error terms of observations.
At the very least, each PLU is included twice; these observations are likely 
correlated, so we are effectively overstating the sample size of our data.
One way to account for that is to resample "PLUs", not PLU-years.
This cluster-bootstrap will resample each PLU (and all its observations), rather than resampling the observations themselves.
```{r}
pauperism_nonmiss %>%
  group_by(ID) %>%
  resamplr::bootstrap(1024) %>%
  # extract the strap column
  `[[`("sample") %>%
  # run 
  map_df(function(dat) {
    lm(mod_formula, data = dat) %>%
    broom::tidy() %>%
    select(term, estimate)
  }) %>%
  # calculate 2.5%, 97.5% and sd of estimates
  group_by(term) %>%
  summarise(
    std.error_bs = sd(estimate),
    conf.low_bsq = quantile(estimate, 0.025),
    conf.low_bsq = quantile(estimate, 0.975)
  ) %>%
  filter(term == "outratiodiff")
```
However, this yields a standard error not much different than the Robust standard error.

1. Try bootstrapping "Region" and "BoothGroup". Do either of these make much difference in the standard errors.

    
## References