07-Statistics.Rmd

# Statistics {#statistics}

```{r setup7, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
                      prompt = FALSE,
                      tidy = TRUE,
                      collapse = TRUE)
library("tidyverse")
library("cowplot")
```

In earlier chapters, we learned to use Excel to construct common
[univariate statistics and charts](#basic-data-analysis-with-excel).  We also
learned the basics of [probability theory](#probability-and-random-events), and
working with [simple](#random-variables) or [complex](#more-on-random-variables)
random variables. The next step is to bring these concepts together, and apply
the theoretical tools of probability and random variables to statistics
calculated from data.

This chapter will develop the theory of mathematical statistics, which treats
our data set and each statistic calculated from the data as the outcome of a
random data generating process.  We will also explore one of the most important
uses of statistics: to ***estimate*** or guess at the value at, some unknown
feature of the data generating process.

::: {.goals data-latex=""}
***Chapter goals***

In this chapter, we will learn how to:

1.  Describe the joint probability distribution of a very simple data set.
2.  Identify the key features of a random sample.
3.  Classify data sets by sampling types.
4.  Find the sampling distribution of a very simple statistic.
5.  Find the mean and variance of a statistic from its sampling distribution.
6.  Find the mean and variance of a statistic that is linear in the data.
7.  Distinguish between parameters, statistics, and estimators.
8.  Calculate the sampling error of an estimator.
9.  Calculate bias and classify estimators as biased or unbiased.
10. Calculate the mean squared error of an estimator.
11. Apply MVUE and MSE criteria to select an estimator.
12. Calculate the standard error for a sample average.
13. Explain the law of large numbers and what it means for an estimator to be
    consistent.
:::

To prepare for this chapter, please review both the
[introductory](#random-variables) and [advanced](#more-on-random-variables)
chapters on random variables, as well as the sections in the data analysis
chapter on [summary statistics](#summary-statistics) and
[frequency tables](#frequency-tables).

## Using statistics

Statistics are just numbers calculated from data. Modern computers make
statistics easy to calculate, and they are easy to interpret as descriptions
of the data.

But that is not the only possible interpretation of a statistic, and it is not
even the most important one. Instead, we regularly use statistics calculated
from data to infer or predict other quantities that are *not* in the data.

1. Statistics Canada may conduct a survey of a few thousand Canadians, and
   use statistics based on that survey to ***infer*** how the other
   40+ million Canadians would have responded to that survey.
   - This is the main application we will consider in this course.
2. Wal-Mart may use historical sales data to ***predict*** how many chocolate
   bunnies it will sell this Easter.  It will then use this prediction to
   determine how many chocolate bunnies to order.
   - We will talk a little about this kind of application.
3. Economists and other researchers will often be interested in making
   ***causal*** or ***counterfactual*** inferences.
   - Counterfactual inferences are predictions about how the data would have
     been different under other (counterfactual) circumstances.
   - Economic fundamentals like supply and demand curves are primarily
     counterfactual because they describe how much would have been bought or
     sold at *each* price (not just the equilibrium price).
   - Causal inferences are inferences about the underlying mechanism that
     produced the data.
   - For example, labour economists are often interested in whether and how much
     the typical individual's earnings would increase if they spent one more
     year in school, or obtained a particular educational credential.
   - Counterfactual and causal inference are beyond the scope of this course,
     but are important in applied economics and may be covered extensively in
     later courses.

Anyone can make predictions, and almost anyone can calculate a few statistics
in Excel.  The hard part is making *accurate* predictions, and selecting
or constructing statistics that will tend to produce accurate predictions.
In order to do that, we will need to construct a probabilistic model that
describes both the random process that generated the data and the process we
follow to construct predictions from the data.

::: example
**Using data to predict roulette outcomes**

Our probability calculations for roulette have relied on two pieces of
knowledge:

- *We know the game's structure*: there are 37 numbered slots, 18 numbers are
  red, 18 are black, and one is green.
- *We know the game is fair*: the ball is equally likely to land in all
  37 numbered slots.

In addition, the game is simple enough that we can do all of the calculations.

But what if we do not know the structure of the game, are not sure the game is
fair, or the game is too complicated to calculate the probabilities?

If we have access to a data set of past results, we can use that data set to:

1. *estimate* the win probability of various bets.
   - This application will be covered in the current chapter.
2. *test* the claim that the game is fair.
   - This application will be covered in the chapter on
     [statistical inference](#statistical-inference)

This approach will be particularly useful for games like poker or blackjack that
are more complex and/or involve human decision making. The win probability in
blackjack depends on choices made by the player, so the house advantage can vary
depending on who is playing, their state of mind (are they distracted,
intoxicated, or trying to show off?), and various other human factors.
:::

## Data and the data generating process

We will start by assuming for the rest of this chapter that we have a
***data set*** or ***sample*** called $D_n$. In most applications, it will be
a [tidy data set](#tidy-data) with $n$ observations (rows) and $K$ numeric
variables (columns). For this chapter, we will further simplify by assuming that
$K = 1$, i.e., that $D_n = (x_1,x_2,\ldots,x_n)$ contains $n$ observations on a
single numeric variable $x_i$. This case will cover all of the univariate
statistics and methods described in
[Chapter 3: Basic data analysis with Excel](#basic-data-analysis-with-excel).

::: example
**Data from two roulette games**

Suppose we have a data set $D_n$ providing the result of $n = 2$ independent
games of roulette.  Let $x_i$ be the result of a bet on red:
\begin{align}
  x_i = \begin{cases} 1 & \textrm{if Red wins game } i \\ 0 & \textrm{if Red loses game } i \\ \end{cases}
\end{align}
Then $D_n = (x_1,x_2)$ where $x_1$ is the result from the first game and $x_2$
is the result from the second game.

For example, suppose red wins the first game and loses the second game. Then our
data could be written in a table as:

| Game \# ($i$) | Result of bet on red ($x_i$) |
|:--------------|:----------------------------:|
|      1        |       1                      |
|      2        |       0                      |

or in a list as $D_n = (1,0)$.

This is the simplest possible example, so we can learn the concepts with the
least possible amount of arithmetic. To make sure you understand the examples
in this chapter, re-do them with the *three*-game data set $D_n = (0,1,0)$.
:::

### Data as random variables 

Our data set $D_n$ is a table or list of *numbers*. We can also think of it
as a set of *random variables* with an unknown joint PDF $f_D$.  This PDF is
sometimes called the ***data generating process*** or DGP for the data set.

This is the fundamental conceptual step in the entire course, so you should
pause for a moment to make sure you understand it. We are thinking of our data
set as two distinct things:

1. The specific set of numbers in front of us.
2. The outcome of some random process that generated those specific numbers this
   time, but could easily have generated other numbers instead.

The goal of statistical analysis is to use the specific set of numbers in front
of us to learn something new about the random process that generated
those specific numbers.

::: example
**The DGP for our roulette data**

The DGP of our two-game roulette data set is just the joint PDF of
$D_n=(x_1,x_2)$:
\begin{align}
  f_D(a,b) &= \Pr(x_1 = a \cap x_2 = b)
\end{align}
where $a$ and $b$ are any real numbers.
:::

### The support of a data set 

The support of the data set $D_n = (x_1,x_2,\ldots,x_n)$ is just the set of all
length-$n$ sequences of numbers that can be constructed from the support of
$x_i$. There are $|S_x|^n$ such sequences, where $S_x$ is the support of $x_i$.

::: example
**The support for our roulette data**

Our two-game roulette data set has a discrete support that includes four
possible values corresponding to the four possible length-2 sequences that
can be constructed from $S_x = \{0,1\}$:
\begin{align}
  S_D = \{(0,0), (0,1), (1,0), (1,1)\}
\end{align}
Note that the order matters here: the outcome $(0,1)$ (red loses game 1 and wins
game 2) is a different outcome from $(1,0)$ (red wins game 1 and loses game 2).
:::

Most real-world data sets have enormous support. For example, our roulette data
set is just about the simplest possible meaningful data set, but the support
for a data set with 100 games would have 
$2^{100} = 1,267,650,600,228,229,401,496,703,205,376$ distinct values. Most
data sets we analyze have many more observations and many more variables than
that, so their support would be even larger.

### The DGP

The exact DGP is usually unknown. But in many cases, we know something about
the underlying process and can make some reasonable assumptions based on what
we know. This can simplify the DGP in ways that will be helpful.

::: example
**Simplifying the DGP of our roulette data**

The DGP for our two-game roulette data set involves four[^702] unknown joint
probabilities, one for each element of the support.

[^702]: It might be more accurate to say it involves only three unknown
probabilities. Since we know the probabilities will sum up to one, if we know
three of the four we can calculate the fourth.

Based on what we know about the game of roulette, we can reasonably assume that
results of different games are independent and that red has the same win
probability in each game. Then the DGP can be written:  
\begin{align}
  f_D(0,0) &= \Pr(x_1 = 0 \cap x_2 = 0) \\
           &= \Pr(x_1 = 0) *\Pr(x_2 = 0) \qquad \textrm{(by independence)}\\
           &= (1-p)^2 \\
  f_D(0,1) &= \Pr(x_1 = 0 \cap x_2 = 1) \\
           &= \Pr(x_1 = 0) *\Pr(x_2 = 1) \qquad \textrm{(by independence)}\\
           &= (1-p)*p \\
  f_D(1,0) &= \Pr(x_1 = 1 \cap x_2 = 0) \\
           &= \Pr(x_1 = 1) *\Pr(x_2 = 0) \qquad \textrm{(by independence)}\\
           &= p*(1-p) \\
  f_D(1,1) &= \Pr(x_1 = 1 \cap x_2 = 1) \\
           &= \Pr(x_1 = 1) *\Pr(x_2 = 1) \qquad \textrm{(by independence)}\\
           &= p^2 \\
  f_D(a,b) &= 0 \qquad \textrm{otherwise}
\end{align}
where $p = \Pr(x_i = 1)$ is the unknown probability that a bet on red wins.

Note that the DGP of $D_n$ is still unknown, but now it can be described in
terms of a single unknown parameter $p$ rather than the full set of four unknown
joint probabilities.
:::

While it is feasible to calculate the DGP for a very small data set, it
quickly becomes impractical to do so as the number of observations increase
and the set of possibilities to consider becomes enormous. Fortunately, we
rarely need to calculate the DGP.  We just need to understand that it *could* be
calculated.

### Simple random sampling

In most applications, we assume that $D_n$ is
***independent and identically distributed*** (IID) or a
***simple random sample*** from a large ***population***. A simple random sample
has two features:

1. All observations are **independent**: Each $x_i$ is an independent random
   variable.
2. All observations are **identically distributed**: Each $x_i$ has the same
   (unknown) marginal distribution.

Random sampling dramatically simplifies the DGP. The joint PDF of a simple
random sample can be written:
\begin{align}
  \Pr(D_n = (a_1,a_2,\ldots,a_n)) = f_x(a_1)f_x(a_2)\ldots f_x(a_n)
\end{align}
where $f_x(a) = \Pr(x_i = a)$ is just the marginal PDF of a single observation.
Independence allows us to write the joint PDF as the product of the marginal
PDFs for each observation, and identical distribution allows us to use the same
marginal PDF for each observation. This reduces the number of unknown numbers
in the DGP from $|S_x|^n$ (the support of $D_n$) to $|S_x|$
(the support of $x$, which is much smaller).

The reason we call this "independent and identically distributed" is hopefully
obvious, but what does it mean to say we have a "random sample" from a
"population"? Well, one simple way of generating an IID sample is to:

1. Define the population of interest, for example all Canadian residents.
2. Use some purely random mechanism[^602] to choose a small subset of cases
   from this population.
   - The subset is called our ***sample***
   - "Purely random" here means some mechanism like a computer's random number
     generator, which can then be used to dial random telephone numbers or
     select cases from a list.
3. Collect data from every case in our sample.

This process will generate a data set that is independent and identically
distributed.

[^602]: As a technical matter, the assumption of independence requires
  that we sample *with replacement*. This means we allow
  for the possibility that we sample the same case more than once.
  In practice this doesn't matter as long as the sample is small 
  relative to the population.
  
::: example
**Our roulette data is a random sample**

Each observation $x_i$ in our two-game roulette data set is an independent
random draw from the $Bernouilli(p)$ distribution where
$p = \Pr(\textrm{Red wins})$.

Therefore, this data set satisfies the criteria for a simple random sample.
:::

Random sampling is at the core of basic statistical analysis for two reasons:

1. It is simple to implement.
2. Results shown later in this chapter imply that a moderately-sized random
   sample provides surprisingly accurate information on the underlying
   population.

However, it is not the only possible sampling process. Alternatives to simple
random sampling will be discussed later in this chapter.

## Statistics and their properties

A ***statistic*** is just a number $s_n =s(D_n)$ that is calculated from the
data. In general, the value of any statistic is:

  - Observed/known since the data set $D_n$ is observed/known.
  - A random variable with a probability distribution that is *well-defined* but
    *unknown*.  This is because the data set $D_n$ is a set of random variables
    with the same characteristics.

I will use $s_n$ to represent a generic statistic, but we will often use
other letters to talk about specific statistics.

::: example
**Roulette wins**

In our two-game roulette data set, the total number of wins is:
\begin{align}
  R = x_1 + x_2
\end{align}
Since this is a number calculated from our data, it is a statistic.

We can think of $R$ as a specific value for our specific data set
$D_n = (1,0)$:
\begin{align}
  R = 1 + 0 = 1
\end{align}
We can also think of it as a random variable whose value would have been
different if the data were different. Since $x_1$ and $x_2$ are
independent draws from the $Bernoulli(p)$ distribution, the total number of
wins has a binomial distribution:
\begin{align}
  R \sim Binomial(2,p)
\end{align}
This distribution is unknown because the true value of $p$ is unknown.
:::

### Summary statistics {#summary-statistics-theory}

The [univariate summary statistics](#summary-statistics) we previously learned
to calculate in Excel will serve as our main examples.

::: example
**Summary statistics for our roulette data**

We can calculate the usual summary statistics for our two-game roulette data
set:

| Statistic      | Formula                      | In roulette data             |
|:---------------|:----------------------------:|:----------------------------:|
| Sample size (count) | $n$                     | $2$                          |
| Sample average | $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ | $\frac{1}{2} (1 + 0) = 0.5$ |
| Sample variance | $\hat{sd}_x^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i-\bar{x})^2$ | $\frac{1}{2-1} (( 1-0.5)^2 + (0-0.5)^2) = 0.5$ |
| Sample std dev. | $\hat{sd}_x = \sqrt{\hat{sd}_x^2}$ | $\sqrt{0.5} \approx 0.71$ | 
| Sample median | $\hat{m} =\frac{x_{[n/2]} + x_{[(n/2) + 1]}}{2}$ if $n$ is even | $\frac{x_{[1]} + x_{[2]}}{2} = \frac{0 + 1}{2} = 0.5$ |
|                | $\hat{m} = x_{[(n/2) + (1/2)]}$ if $n$ is odd ||

:::

We also learned to construct both [simple](#simple-frequency-tables) and
[binned](#binned-frequency-tables) frequency tables.  Let
$B \subset \mathbb{R}$ be a bin of values. Each bin would contain a single value for a simple frequency table, or multiple values for a binned frequency table.

Given a particular bin, we can define:

- The ***sample frequency*** or ***relative sample frequency*** of bin $B$ is
  the proportion of cases in which $x_i$ is in $B$:
  \begin{align}
    \hat{f}_B = \frac{1}{n} \sum_{i=1}^n I(x_i \in B)
  \end{align}
- The ***absolute sample frequency*** of bin $B$ is the *number* of cases in
  which $x_i$ is in $B$:
  \begin{align}
    n \hat{f}_B = \sum_{i=1}^n I(x_i \in B)
  \end{align}

We can then construct each cell in a frequency table by choosing the appropriate
bin. 

::: example
**Frequency statistics for our roulette data**

We can calculate the usual summary statistics for our two-game roulette data:

| Statistic      | Formula                      | In roulette example          |
|:---------------|:----------------------------:|:----------------------------:|
| Relative frequency | $\hat{f}_B = \frac{1}{n} \sum_{i=1}^n I(x_i \in B)$ | depends on $B$ |
| $\quad B=\{0\}$ | $\hat{f}_0 = \frac{1}{n} \sum_{i=1}^n I(x_i = 0)$ | $\frac{1}{2}(0 + 1) = 0.5$ |
| $\quad B=\{1\}$ | $\hat{f}_1 = \frac{1}{n} \sum_{i=1}^n I(x_i = 1)$ | $\frac{1}{2}(1 + 0) = 0.5$ |
| Absolute frequency | $n\hat{f}_B = \sum_{i=1}^n I(x_i \in B)$ | depends on $B$ |
| $\quad B=\{0\}$ | $n\hat{f}_0 = \sum_{i=1}^n I(x_i = 0)$ | $(0 + 1) = 1$ |
| $\quad B=\{1\}$ | $n\hat{f}_1 = \sum_{i=1}^n I(x_i = 1)$ | $(1 + 0) = 1$ |

:::


### The sampling distribution

We call the probability distribution of a statistic its
***sampling distribution***. In principle, the sampling distribution of any
statistic can be directly derived from the DGP of its data. The sampling
distribution is therefore:

- Unknown since the DGP $f_D$ is unknown.
- Fixed (non-random) since the DGP is a function $f_D$ and not a random
  variable.

In practice, the sampling distribution is difficult to calculate outside of a
few simple examples. The important part is to understand what a sampling
distribution is, that every statistic has one, and that it depends on the
(usually unknown) DGP.

::: example
**The sampling distribution of the sample average in our roulette data**

In our two-game roulette data set, the sample average is:
\begin{align}
  \bar{x} = \frac{1}{2} (x_1 + x_2)
\end{align}
Since there are four possible values of $(x_1,x_2)$, we can determine the
sampling distribution of the sample average by enumeration.

| Data ($D_2$) | Probability ($f_D$) | Sample Average ($\bar{x}$) |
|:-------------|:-------------------:|:--------------------------:|
| $(0,0)$      | $(1-p)^2$           | $0.0$                      |
| $(0,1)$      | $p(1-p)$            | $0.5$                      |
| $(1,0)$      | $p(1-p)$            | $0.5$                      |
| $(1,1)$      | $p^2$               | $1.0$                      |

Therefore, the sampling distribution of $\bar{x}$ in this data set can be
described by the PDF:
\begin{align}
  f_{\bar{x}}(a) \equiv Pr(\bar{x}=a) 
    &= \begin{cases}
       (1-p)^2 & \textrm{if $a=0$} \\
       2p(1-p) & \textrm{if $a=0.5$} \\
       p^2 & \textrm{if $a=1$} \\
       0 & \textrm{otherwise} \\
       \end{cases}
\end{align}
and the support of $\bar{x}$ is $S_{\bar{x}} = \{0,0.5,1\}$.
:::

### The mean {#the-mean-of-a-statistic}

Since a statistic has a probability distribution, it has an expected value[^701]
(mean).

[^701]: All random variables we will see in this course will have an expected
value, but it is possible for a random variable to have a well-defined PDF but
not a well-defined expected value. For example, if $x \sim N(0,1)$, then $y=1/x$
has this property.


::: example
**The mean of the sample average in the roulette data**

We can calculate the expected value of $\bar{x}$ in the two-game roulette
data set directly from its PDF, which we derived in the previous example:
\begin{align}
  E(\bar{x}) &= \sum_{a \in S_{\bar{x}}} a f_{\bar{x}}(a) \\ 
    &= 0 \times f_{\bar{x}}(0) + 0.5 \times f_{\bar{x}}(0.5) + 1 \times f_{\bar{x}}(1) \\
    &= 0 \times (1-p)^2 + 0.5 \times 2p(1-p) + 1.0 \times p^2 \\
    &= p
\end{align}
:::

As mentioned earlier, it is often impractical or impossible to calculate the
complete sampling distribution for a given statistic.  Fortunately, we do not
always need the complete sampling distribution to calculate the mean.

::: {.example #mean-of-sample-average}
**Another way to find the mean of the sample average**

The sample average is just a sum, so in our two-game roulette data set:
\begin{align}
  E(\bar{x}) &= E\left(\frac{1}{2}(x_1 + x_2)\right)  \\
             &= \frac{1}{2}\left(E(x_1) + E(x_2)\right) \qquad \textrm{by linearity} \\
             &= \frac{1}{2} (p + p)  \quad \textrm{since $E(x_i)=p$} \\
             &= p
\end{align}
Note that this is the same answer as we derived directly from the PDF.
:::

The results in Example \@ref(exm:mean-of-sample-average) can be generalized to
apply to any sample average in a random sample. More specifically, suppose we
have a simple random sample of size $n$ on the random variable $x_i$ with
unknown mean $E(x_i) = \mu_x$. Then the expected value of the sample average is:
  \begin{align}
    E(\bar{x}) &= E\left( \frac{1}{n} \sum_{i=1}^n x_i\right) \\
      &= \frac{1}{n} \sum_{i=1}^n E\left( x_i\right) \\
      &= \frac{1}{n} \sum_{i=1}^n \mu_x \\
      &= \mu_x
  \end{align}
This is an important result in statistics, so you should follow it step-by-step
to make sure you understand.  If you are struggling with it, look at the simple
example first. The key is to recognize that the sample average is a sum, and
so we can apply the linearity of the expected value. We have derived this result
for the specific case of a simple random sample, but it applies for many other
common sampling schemes.

### The variance {#the-variance-of-a-statistic}

Statistics also have a variance and a standard deviation, and they are often
easy to calculate.

::: {.example #variance-of-sample-average}
**The variance of the sample average in the roulette data**

In our two-game roulette data set, the variance of the sample average is:
  \begin{align}
    var(\bar{x}) &= var\left(\frac{1}{2}(x_1 + x_2)\right) \\
      &= \left(\frac{1}{2}\right)^2 var(x_1 + x_2) \\
      &= \frac{1}{4} \left( var(x_1) +
          \underbrace{2 \, cov(x_1,x_2)}_{\textrm{$= 0$ (by independence)}} + var(x_2) \right) \\
      &= \frac{1}{4} \left( 2 \, var(x_i) \right) \\
      &= \frac{var(x_i)}{2}
  \end{align}

Notice that $var(\bar{x}) < var(x_i)$. Averages are typically less variable
than the thing they are averaging.
:::

The result in Example \@ref(exm:variance-of-sample-average) can be generalized
to the variance of any sample average in a random on the random variable $x_i$
with mean $E(x_i)=\mu_x$ and variance $var(x_i)=\sigma^2$. Then:
\begin{align}
  var(\bar{x}) &= \frac{\sigma_x^2}{n} \\
  sd(\bar{x}) &= \frac{\sigma_x}{\sqrt{n}}
\end{align}
I won't ask you to prove this, but the proof is just a longer version of Example
\@ref(exm:variance-of-sample-average) above.

::: warning
**Random variables, expected values, and statistics**

Before proceeding, be sure you understand the distinction between:

- The sample average $\bar{x}$ and the value of a single observation $x_i$.
- The expected values $\mu_x = E(x_i)$ and $E(\bar{x})$.
- The variances $\sigma_x^2 = var(x_i)$ and $var(\bar{x})$.

One particularly common mistake is to confuse $\bar{x}$ and $\mu_x$.
:::


## Estimation {#estimation}

One of the most important uses of statistics is to estimate, or guess the value of, some unknown feature of the population or DGP. 

### Parameters

A ***parameter*** is an unknown number $\theta = \theta(f_D)$ whose value
depends on the DGP. Since a parameter is constructed from the DGP, its
value is:

- Unobserved/unknown since the DGP $f_D$ is unknown.
- Fixed (not random) since the DGP $f_D$ is a function and not a random
  variable.

I will use $\theta$ to represent a generic parameter, but we will often use
other letters to talk about specific parameters.

::: example
**Examples of parameters**

Sometimes a single parameter completely describes the DGP:

- In our two-game roulette data set, the joint distribution of the data depends
  only on the (known) sample size $n$ and the single (unknown) parameter
  $p = \Pr(\textrm{Red wins})$.
  
Sometimes a group of parameters completely describe the DGP:

- If $x_i$ is a random sample from the $U(L,H)$ distribution, then $L$ and $H$
  are both parameters.
  
And sometimes a parameter only partially describes the DGP

- If $x_i$ is a random sample from some unknown distribution with unknown mean
  $\mu_x = E(x_i)$, then $\mu_x$ is a parameter.
- If $x_i$ is a random sample from some unknown distribution, then
  $f_5 = \Pr(x = 5)$ is a parameter.
:::

Typically there will be specific parameters whose value we wish to know. Such
a parameter is called a ***parameter of interest***. The DGP may include other
parameters, which are typically called *auxiliary parameters* or
*nuisance parameters*. 

### Estimators

An ***estimator*** is any statistic $\hat{\theta}_n = \hat{\theta}(D_n)$ that is
used to ***estimate*** (guess at the value of) an unknown parameter of interest
$\theta$. Since an estimator is constructed from $D_n$, its value is:

  - Observed/known since the data set $D_n$ is observed/known.
  - A random variable with a well-defined but unknown probability distribution
    since the data set $D_n$ also has those properties.

I will use $\hat{\theta}$ to represent a generic estimator, but we will often
use other notation to talk about specific estimators. The circumflex or "hat"
$\hat{\,}$ notation is commonly used to identify an estimator; for example,
$\hat{\mu}$ may be used to represent an estimator of the parameter $\mu$ and
$\hat{\sigma}$ may be used to represent an estimator of the parameter $\sigma$

::: example
**Two estimators for the win probability**

Consider our two-game roulette data set $D_n = (x_1,x_2) = (0,1)$, and suppose
our parameter of interest is the win probability $p$. I will propose two
estimators for $p$:

1. The sample average:
   \begin{align}
     \bar{x} &= \frac{1}{2} (x_1 + x_2) \\
               &= \frac{1}{2} (0 + 1) \\
               &= 0.5
   \end{align}
2. The value of the first observation:
   \begin{align}
     x_1 &= 1
   \end{align}

These are both statistics calculated from the data, so they are both potential
estimators for $p$.
:::

An estimator is just a rule for making guesses; any statistic can be used as as
an estimator of any parameter. But we need to pick a specific guess, and we want
our guess to be an accurate one. So we will need some kind of ***criterion***
that allows us to compare different statistics and choose the statistic that
represents the "best" estimator of a particular parameter.

Intuitively, a good estimator is one that is unlikely to be very different from
the true value of the unknown parameter.  We can quantify "unlikely" and
"very different" more precisely by introducing the concepts of sampling error,
bias, and mean squared error.

### Sampling error

Let $\hat{\theta}_n$ be a statistic we are using as an estimator of some
parameter of interest $\theta$.  We can define its ***sampling error*** as:
\begin{align}
  err(\hat{\theta}_n) = \hat{\theta}_n - \theta
\end{align}
In principle, we want $\hat{\theta}_n$ to be a good estimator of $\theta$, i.e.,
we want the sampling error to be as close to zero as possible.  

Since the sampling error depends on *both* the estimator $\hat{\theta}_n$ and
the true parameter value $\theta$, its value is:

1. Unknown/unobservable since $\theta$ is unknown.
2. Random since $\hat{\theta}_n$ is random.

Always remember that $err(\hat{\theta}_n)$ is not an inherent property of the
statistic - it depends on the relationship between the statistic and the
parameter of interest. A given statistic may be a good estimator of one
parameter, and a bad estimator of another parameter.

::: example
**Sampling error for our two estimators**

The sampling error for our two estimators is:
\begin{align}
  err(\bar{x}) &= \bar{x} - p \\
               &= 0.5 - p \\
  err(x_1) &= x_1 - p \\
               &= 1 - p \\
\end{align}
Notice that these are random variables but we do not know their values
since they depend on the unknown parameter $p$.
:::

### Bias 

The ***bias*** of an estimator is defined as its expected sampling error:
\begin{align}
  bias(\hat{\theta}_n) &= E(err(\hat{\theta}_n)) \\
    &= E(\hat{\theta}_n - \theta) \\
    &= E(\hat{\theta}_n) - \theta
\end{align}
Ideally we would want $bias(\hat{\theta}_n)$ to be zero, in which case we would
say that $\hat{\theta}_n$ is an ***unbiased*** estimator of $\theta$. 

Since it depends on $E(\hat{\theta}_n)$ and $\theta$, the bias is generally:

- Unobserved/unknown since $E(\hat{\theta}_n)$ and $\theta$ both depend on 
  the DGP
- Fixed/nonrandom since $E(\hat{\theta}_n)$ and $\theta$ are numbers and not
  random variables

Although the bias is *generally* unknown, there are some important cases in
which we can prove that it is zero.

::: example
**Two unbiased estimators**

In our two-game roulette data, the bias of the sample average as an estimator
of $p$ is:
\begin{align}
  bias(\bar{x}) &= E(\bar{x}) - p \\
    &= p - p \\
    &= 0
\end{align}
and the bias of the first observation is:
\begin{align}
  bias(x_1) &= E(x_1) - p \\
    &= p - p \\
    &= 0
\end{align}
Therefore, both of these estimators are unbiased. This example illustrates a
general principle: there is rarely exactly one unbiased estimator.  There are
either none, or many.
:::

More generally, suppose we have a random sample of size $n$ on some
random variable $x_i$ with mean $E(x_i) = \mu_x$.  We earlier showed in this
case that:
\begin{align}
  E(\bar{x}) &= \mu_x
\end{align}
So the sample average $\bar{x}$ is an unbiased estimator of $\mu_x$:
\begin{align}
  bias(\bar{x}) &= E((\bar{x}) - E(x_i) \\
      &= \mu_x - \mu_x \\
      &= 0
\end{align}
This is true for any random sample and any random variable $x_i$.

If the bias is nonzero, we would say that $\hat{\theta}_n$ is a ***biased***
estimator of $\theta$. The exact amount of bias of a particular biased estimator
is usually hard to know because it depends on the unknown true DGP. But
we can sometimes say something about its direction, or about how it relates
to specific parameters of the DGP.

::: example
**A biased estimator of the median**

Consider our two-game roulette data set, and suppose we wish to estimate the
(population) median of $x_i$. Applying our
[definition of the median](#the-median) from an earlier chapter, the median of
$x_i$ is:
\begin{align}
  m &= \begin{cases}
    0 & \textrm{if $p \leq 0.5$} \\
    1 & \textrm{if $p > 0.5$} \\
    \end{cases}
\end{align}
A natural estimator of the median is the sample median. In our two-observation
example, the sample median would be:
\begin{align}
  \hat{m} &= \frac{1}{2} (x_1 + x_2)
\end{align}
and its expected value would be $E(\hat{m}) = p$.
\begin{align}
  E(\hat{m}) &= E\left(\frac{1}{2} (x_1 + x_2) \right) \\
    &= \frac{1}{2} \left( E(x_1) + E(x_2) \right) \\
    &= \frac{1}{2} \left( p + p \right) \\
    &= p
\end{align}
So its bias as an estimator of $m$ is:
\begin{align}
  bias(\hat{m}) &= E(\hat{m}) - m \\
    &= p - m \\
    &= \begin{cases}
      p & \textrm{if $p \leq 0.5$} \\
      p-1 & \textrm{if $p > 0.5$} \\
      \end{cases}
\end{align}
Figure \@ref(fig:medex) below shows the true population median (in blue), the
expected value of the sample median (in orange), and the bias (in red). As you
can see, the bias is nonzero for most values of $p$, but its direction and
magnitude vary.

This result applies more generally: the sample median is typically a biased
estimator of the median. It is still the most commonly-used estimator for the
median, for reasons we will discuss soon.
:::
```{r medex, fig.cap = "*The sample median is a biased estimator*"}
medex <- tibble(p=seq(from=0,to=1,length.out=400),
                      m=as.integer(p > 0.5),
                      bias=p-m)
ggplot(data=medex,mapping=aes(x=p,y=m)) +
  geom_step(col = "navy",linewidth=1) +
  geom_line(aes(y=p),col="darkorange",alpha=0.4,linewidth=2) +
  geom_line(aes(y=bias),col="red",linetype=2,linewidth=1) +
  geom_text(x=0.43,y=0.9,col="navy",label="Median") +
  geom_text(x=0.83,y=0.6,col="darkorange",label="E(sample median)") +
  geom_text(x=0.6,y=-0.25,col="red",label="Bias") +
  xlab("Win probability (p)") +
  ylab("") +
  labs(title = "The sample median is a biased estimator", 
       subtitle = "Two-game roulette example", 
       caption = "", 
       tag = "")
```


### Variance and the MVUE

We usually prefer unbiased estimators to biased estimators, but that isn't
enough to pick an estimator. In general, if we can find one unbiased estimator
there are usually many others. So we need to apply at least one more criterion.

A natural second criterion is the ***variance*** of the estimator:
\begin{align}
  var(\hat{\theta}_n) = E[(\hat{\theta}_n-(E(\hat{\theta}_n))^2]
\end{align}
Why do we care about the variance? 

- If $\hat{\theta}_n$ is unbiased, then $E(\hat{\theta}_n)=\theta$.
- Lower variance means that $\hat{\theta}_n$ is typically closer to
  $E(\hat{\theta}_n)$.

Therefore, among unbiased estimators, lower-variance estimators are preferable
because they are typically closer to the true parameter value. We can put this
idea into a formal criterion which is called the "MVUE".

The ***minimum variance unbiased estimator*** (MVUE) of a parameter is the
unbiased estimator with the lowest variance, and the ***MVUE criterion*** for
choosing an estimator says to choose the MVUE.

::: example
**The MVUE for roulette**

Returning to our two-game roulette data set and two proposed estimators
($\bar{x}$ and $x_1$), we can find the MVUE by following these steps:

1. *Calculate the bias* of each proposed estimator. We calculated this earlier:
   \begin{align}
     bias(\bar{x}) &= 0 \\
     bias(x_1) &= 0
   \end{align}
2. *Calculate the variance* of each proposed estimator. We calculated this
   earlier:
   \begin{align}
     var(\bar{x}) &= \frac{var(x_i)}{2} \\
     var(x_1)     &= var(x_i)
   \end{align}
3. *Choose the unbiased estimator with the lowest variance*, if there is an
   unbiased estimator.
   - Both estimators are unbiased.
   - The sample average $\bar{x}$ has lower variance $var(\bar{x}) < var(x_1)$. 
   - Therefore $\bar{x}$ is the MVUE.
:::

### Mean squared error

Once we move beyond the simple case of the sample average, we run
into two major complications with the MVUE criterion:

1. *No unbiased estimator*: An unbiased estimator may not exist for a particular
   parameter of interest.

   - For example, there is no unbiased estimator of the median, or of any other
     quantile.
   - If there is no unbiased estimator, there is no MVUE.
   - So we need some other way of choosing an estimator.  

2. *Bias/variance trade-off*: Sometimes we have both an unbiased estimator with
   high variance and another estimator with much lower variance but just a
   little bit of bias.
   
   - A detailed example of this case is provided below. 
   - Here, the unbiased estimator is the MVUE.
   - But we may not be happy with this choice if the bias is small enough and
     the variance of the unbiased estimator is large enough.


::: example
**The relationship between age and earnings**

Labour economists are often interested in the relationship between age and
earnings. Typically, workers earn more as they get older but earnings do not
increase at a constant rate.  Instead, earnings rise rapidly in a typical
worker's 20s and 30s, then gradually flatten out.  This pattern affects many
economically important decisions like education, savings, household formation, 
having children, etc.

Suppose we want to estimate the earnings of the average 35-year-old Canadian,
and have access to a random sample of 800 Canadians with 10 observations for
each age between 0 and 80.

The average earnings of 35-year-olds in our data would be an unbiased estimator
of the average earnings of 35-year-olds in Canada.  However, it would be based
on only 10 observations, and its variance would be very high.

We could increase the sample size and reduce the variance by including
observations from people who are *almost* 35 years old. We have many
options, including:

  - Average earnings of the  10 35 year olds in our data.
  - Average earnings of the  30 34-36 year olds in our data.
  - Average earnings of the 100 30-39 year olds in our data.
  - Average earnings of the 800  0-80 year olds in our data. 

Widening the age range will reduce the variance of these averages, but will
introduce bias (since they have added people that are not exactly like
our target population of 35-year-olds). It is not clear which age range will
tend to produce the most accurate estimator of the parameter of interest
(average earnings of 35 year olds in Canada).
:::

This set of issues implies that we need a criterion that:

  - Can be used to choose between biased estimators.
  - Can choose slightly biased estimators with low variance over unbiased
    estimators with high variance.

The ***mean squared error*** of an estimator is defined as the expected value
of the squared sampling error:
\begin{align}
  MSE(\hat{\theta}_n) &= E[err(\hat{\theta}_n)^2] \\
    &= E[(\hat{\theta}_n-\theta)^2] 
\end{align}
and the ***MSE criterion*** says to choose the (biased or unbiased) estimator
with the lowest MSE.

While this is the definition of MSE, we can derive a handy formula:
\begin{align}
  MSE(\hat{\theta}_n) &= var(\hat{\theta}_n) + [bias(\hat{\theta}_n)]^2
\end{align}
This is the formula we will usually use to calculate MSE.  A few things to note
about this formula:

1. Both bias and variance enter into the formula.  So all else equal, the MSE
   criterion still favors less biased estimators and lower variance estimators.
2. The bias is squared, meaning both positive and negative bias are treated as
   equally bad.

::: example
**The MSE for our two estimators**

Returning to our two-game roulette data set, we can apply the MSE criterion to
choose between our proposed estimators by following these steps:

1. *Calculate bias and variance* for each estimator. We have already done this:
   \begin{align}
     bias(\bar{x}) &= 0 \\ 
     bias(x_1) &= 0 \\ 
     var(\bar{x}) &= \frac{var(x_i)}{2} \\
     var(x_1)     &= var(x_i)
   \end{align}
2. *Calculate MSE* using the variance/bias formula:
   \begin{align}
     MSE(\bar{x})
       &= var(\bar{x}) + [bias(\bar{x})]^2 \\ 
       &= \frac{var(x_i)}{2} + [0]^2 \\ 
       &= \frac{var(x_i)}{2} \\
     MSE(x_1)
       &= var(x_1) + [bias(x_1)]^2 \\ 
       &= var(x_i) + [0]^2 \\ 
       &= var(x_i)
   \end{align}
3. *Choose the estimator with the lowest MSE*. In this case, 
   $MSE(\bar{x}) < MSE(x_1)$ so $\bar{x}$ is the preferred estimator by the
   MSE criterion.

Note that in this example, the sample average is the preferred estimator by both
the MVUE criterion and the MSE criterion. But that will not always be the case.
:::

The MSE criterion allows us to choose a biased estimator with low variance over
an unbiased estimator with high variance, and also allows us to choose between
biased estimators when no unbiased estimator exists.

::: {.fyi data-latex=""}
**Estimating bias and MSE**

In most cases, the bias and mean squared error of our estimators depend on the
unknown DGP and cannot be calculated.  So why do we bother talking about them?

1. They provide a clear framework for thinking about trade-offs in data
   analysis, even when we are unable to precisely quantify those trade-offs.
2. Some advanced statistical techniques such as *cross-validation* and the
   *bootstrap* make it possible to estimate the bias and/or MSE of an
   estimator. These methods work (roughly) by estimating the parameter on
   multiple random subsets of the data, treating the original data set as the
   "true" population.
   
Cross-validation is a critical element of most machine learning techniques,
which typically work by estimating a very large number of complex prediction
models and then selecting the model that performs best (by the MSE criterion)
in cross-validation.
:::

### Standard errors

Parameter estimates are typically reported along with their
***standard errors***.  The standard error of a statistic is an estimate of its
standard deviation, so it gives a rough idea of how accurate the estimate is
likely to be.

The first step in constructing a standard error is to find the actual standard
deviation of the statistic, in terms of (probably unknown) parameters of the
DGP.

::: example
**The standard deviation of the average in a random sample**

Consider a random sample of size $n$ on the random variable $x_i$ with unknown
mean $\mu_x = E(x_i)$ and unknown variance $\sigma_x^2 = var(x_i)$, and suppose
we want to find the standard deviation of the sample average $\bar{x}$.

We earlier showed that the variance of the sample average in a random sample
is:
\begin{align}
  var(\bar{x}) &= \frac{var(x_i)}{n} \\
    &= \frac{\sigma_x^2}{n} \\
\end{align}
So its standard deviation is:
\begin{align}
  sd(\bar{x}) &= \frac{\sigma_x}{\sqrt{n}}
\end{align}
The sample size $n$ is known, but the standard deviation $\sigma_x$ is
unknown.  So we will need to estimate it in order to estimate $sd(\bar{x})$.
:::

The next step is to find suitable estimators for any unknown parameter values.

::: example
**Estimating the standard deviation**

In a random sample, the sample variance is an unbiased estimator of the
corresponding population variance:
\begin{align}
  E(sd_x^2) = \sigma_x^2 = var(x_i)
\end{align}
This is not hard to prove, but I will skip it for now.

Unfortunately, the sample *standard deviation* is a slightly biased estimator
of the population standard deviation.  This is because the square root is not
a linear function:
\begin{align}
  E(sd_x) = E(\sqrt{sd_x^2}) \neq \sqrt{E(sd_x^2)} = \sigma_x
\end{align}
However, the bias is typically small and there is no available unbiased
estimator. So this is the estimator we typically use.
:::

Finally, we substitute to get the formula for the standard error.

::: example
**The standard error of the average**

The usual standard error for the sample average in a random sample of size $n$
is:
\begin{align}
  se(\bar{x}) = \frac{sd_x}{\sqrt{n}}
\end{align}
where $sd_x$ is the sample standard deviation.
:::

This procedure works in a wide variety of settings, and it is typically done
automatically by statistical applications.

::: example
**The standard error in the roulette data**

We can use these results to calculate the standard error for $\bar{x}$ in our
two-game roulette data set.  First, our general result above applies:
\begin{align}
  se(\bar{x}) = \frac{sd_x}{\sqrt{n}}
\end{align}
We have two observations ($n=2$) and we calculated the sample standard deviation
$(sd_x = \sqrt{0.5} \approx 0.71)$ earlier, so we can get the standard error
by plugging in these values:
\begin{align}
  se(\bar{x}) &= \frac{0.71}{\sqrt{2}} \\
      &= 0.5
\end{align}
Following a similar procedure, the standard deviation of the first-observation
estimator is $sd(x_i) = \sigma_x$ and so its standard error is:
\begin{align}
  se(x_1) &= sd_x \\
    &= 0.71
\end{align}
Notice that the first-observation estimator has a higher standard error.
:::

Standard errors are useful as a data-driven measure of how variable a particular
statistic or estimator is likely to be.  We will also use them more formally
as part of formulas for hypothesis testing and other
[statistical inference procedures](#statistical-inference).


## The law of large numbers

Our analysis of the sampling distribution for a statistic/estimator, and
related properties such as mean, variance, bias, and mean squared error,
has mostly looked at sample averages.  The sample average is particularly
easy to characterize because it is a *sum*, and so we can exploit the
linearity of the expected value to derive simple characterizations of
its distribution. We can do this for a few other statistics, for example
the sample frequency, but most statistics - including medians, quantiles,
and standard deviations - are *nonlinear* functions of the data.

In order to deal with those statistics, we need to construct approximations
based on their ***asymptotic*** properties.  The asymptotic properties of
a statistic are properties that hold approximately, with the approximation
getting closer and closer to the truth as the sample size gets larger.

Every property of a statistic we have discussed so far - sampling distribution,
mean, variance, bias, mean squared error, etc. -  holds exactly for any
sample size.  Such properties are sometimes called ***exact*** or
***finite sample*** properties, to distinguish them from asymptotic
properties.

We will state two main asymptotic results in this chapter: the law of large
numbers and Slutsky's theorem.  A third asymptotic result called the central 
limit theorem will be discussed later. 

All three results rely on the concept of a limit, which you would have learned
in your calculus course.  If you need to review that concept, please see the
[section on limits](#limits) in the math appendix.  However, I will not expect
you to do any significant amount of math with limits.  Please focus on
the intuition and interpretation and don't worry too much about the math.

### Defining the LLN

The ***law of large numbers*** (LLN) says that for a large enough random
sample, the sample average is nearly identical to the corresponding population
mean with a very high probability.

::: {.fyi data-latex=""}
**The law of large numbers**

In order to state the LLN in a more mathematically rigorous way, we need to
clarify what "large enough", nearly identical" and  "very high" mean.  This will
require the use of limits.

Consider a data set $D_n$ of size $n$, and think of it as part of an infinite
sequence of data sets $(D_1, D_2, \ldots)$ where $D_n$ is just the first $n$
observations in an infinite sequence of observations $(x_1,x_2,\ldots)$.
Let $s_n$ be some statistic calculated from $D_n$; we can also think
of it as part of an infinite sequence $(s_1,s_2,\ldots)$.

We say that $s_n$ ***converges in probability***  to some constant $c$ if:
\begin{align}
  \lim_{n \rightarrow \infty} \Pr( |s_n - c| < \epsilon) = 1
\end{align}
for any positive number $\epsilon > 0$.

Intuitively, what this means is that for a sufficiently large $n$ 
(the $\lim_{n \rightarrow \infty}$ part), $s_n$ is almost certainly
(the $\Pr(\cdot) = 1$ part) very close to $c$ (the $|s_n-c| < \epsilon$ part).

We have a compact way of writing convergence in probability:
\begin{align}
  w_n \rightarrow^p c
\end{align}
means that $w_n$ converges in probability to $c$. 

Having defined our terms we can now state the law of large numbers.

**LAW OF LARGE NUMBERS**: Let $\bar{x}_n$ be the sample average 
from a random sample of size $n$ on the random variable $x_i$ with
mean $E(x_i) = \mu_x$. Then:
\begin{align}
  \bar{x}_n \rightarrow^p \mu_x
\end{align}
:::

One way of understanding the law of large numbers is to look at our earlier
results on mean squared error.  We earlier found that
$MSE(\bar{x}_n) = var(x_i)/n$ in a random sample.  Note that the right side
of this equation goes to zero as $n$ goes to infinity, implying that the
distribution of $\bar{x}_n$ gradually settles down to a a very small range
around $\mu_x$.

::: {.economics data-latex=""}
**The LLN in the economy**

The law of large numbers is extremely powerful and important, as it
is the basis for the gambling industry, the insurance industry,
and much of the banking industry. 

A casino makes money in games like roulette by taking in a *large* number of
*independent* small bets. These bets have a small house advantage, so their
expected benefit to the casino is positive. The casino makes money on some
bets and loses money on others, but the LLN ensures that a casino is almost
certain to have more profits than losses if it takes enough bets.

Gambling is often considered a glamorous but shady business and insurance is
often considered a boring but respectable business, but in many ways they are
the same business. An insurance company operates just like a casino. Each of us
faces a small risk of a catastrophic cost: a house that burns down, a car
accident leading to serious injury, etc. Insurance companies collect a little
bit of money from each of us, and pay out a lot of money to the small number of
people who have claims. In other words, you are betting your car insurance
company that you will have an acccident, you are betting your health insurance
company that you will need medical care, and you are betting your life insurance
company that you will die.

Although the context is quite different, the underlying economics are identical
to those of a casino: the insurance company prices its products so that its
revenues exceed its expected payout, and then takes on a large number of
independent risks to ensure that its revenues exceed its actual payout.

Sometimes insurance companies do lose money, and even go bankrupt. The usual
cause of this is a big systemic event like a natural disaster, pandemic or
financial crisis that affects everyone.  Here the independence needed for the
LLN does not apply, and an insurance company can take losses that substantially 
exceed its gains.  For example, several insurance companies in Florida went
bankrupt in 2022 as a result of property damage claims from Hurricane Ian.

Casinos can also face non-independent risks, for example in sports betting where
many players are betting on the same outcome. Casinos typically address this by
using different betting rules: rather than promising a fixed payout to winners,
they keep a fixed percentage (sometimes called the "vigorish" or "vig") of the
total amount bet, and pay the rest out to the winners.
:::

### Consistent estimation

We say that the statistic $\hat{\theta}_n$ is a ***consistent*** estimator of a
parameter $\theta$ if $\hat{\theta}_n$ is nearly identical to $\theta$ with very
high probability for a large enough sample.

The law of large numbers implies that the sample average is a consistent
estimator of the corresponding population mean, but we can go much further than
that.  Almost all commonly-used estimators are consistent in a random sample.
For example:

- The sample variance is a consistent estimator of the population variance.
- The sample standard deviation is a consistent estimator of the population
  standard deviation.
- The relative sample frequency is a consistent estimator of the population
  probability.
- The sample median is a consistent estimator of the population median.
- All other sample quantiles are consistent estimators of the corresponding
  population quantile.

Similar results can be found for most other commonly-used estimators and
sampling schemes. The reason for this is an advanced result called Slutsky's
theorem.

::: {.fyi data-latex=""}
**Consistency and Slutsky's theorem**

More formally, we say that $\hat{\theta}_n$ is a consistent estimator of
$\theta$ if:
\begin{align}
  \hat{\theta}_n \rightarrow^P \theta
\end{align}
As said earlier, most commonly-used estimators are consistent.

The key to demonstrating this property is a result called
***Slutsky's theorem***. Slutsky's theorem roughly says that if the law of
large numbers applies to a statistic $s_n$, it also applies to $g(s_n)$ for
any continuous function $g(\cdot)$.

**SLUTSKY THEOREM**: Let $g(\cdot)$ be a continuous function. Then:
\begin{align}
  s_n \rightarrow^p c \implies g(s_n) \rightarrow^p g(c)
\end{align}

Almost all commonly-used estimators can be written as a continuous function of
one or more sample averages, and almost all parameters can be written as a
continuous function of one or more population means.  So we can prove
consistency of an estimator in two steps:

1. Use the LLN to prove that the sample averages converge to their expected
   values.
2. Use Slutsky's theorem to prove that the function of sample averages (the
   estimator) converges to the same function of the corresponding expected
   values (the parameter).

Actually doing that for a particular estimator is well beyond the scope of this
course.  The important thing to know is that it can be done, and to have a
clear idea what consistency means.
:::

## Beyond simple random sampling

Statistical theory is built around simple random sampling, but many interesting
data sets are built from more complex sampling procedures. This section
describes a few of these procedures and their implications.

### Time series data

A ***time series*** data set is constructed by repeatedly observing a variable
(or several variables) at multiple points in time.  Our historical
employment data set is an example of time series data, as are most other
macroeconomic variables such as GDP, population, inflation, interest rates, etc.

Time series have several features that are inconsistent with the random sampling
assumption:

- They usually have clear *time trends*.  
  - Example: Canada's real GDP has been steadily growing for over 100 years.
  - This violates the assumption that observations are identically distributed
    since the expected value changes over time.
- They usually have clear recurring cyclical patterns or *seasonality*.  
  - Example: Canada's unemployment rate is usually lower in December.
  - This violates the assumption that observations are identically distributed
    since the expected value changes from one month to the next.
- They usually exhibit what is called *autocorrelation*. 
  - Example: shocks to the economy (COVID, financial crises) have effects
    that last more than a single month or quarter, causing values in nearby
    time periods to be related to one another.
  - This violates the assumption that observations are independent.

We can calculate statistics for time series - we already did in Chapter
\@ref(basic-data-analysis-with-excel) - and they can be used to describe the
underlying DGP under certain conditions. However, using time series data for
rigorous statistical analysis and prediction requires more advanced techniques
and terminology than we will learn in this course.

::: {.sfu data-latex=""}
**ECON 433**

Our fourth-year course ECON 433 covers time series econometrics, with
applications in macroeconomics and finance.
:::

### Other sampling models

Some data sets are built from sampling that is not entirely random. For example:

- A ***stratified sample*** is collected by dividing the population into
  *strata* (subgroups) based on some observable characteristics, and then randomly
  sampling a predetermined number of cases within each stratum. 
  - Most professional surveys are constructed from stratified samples rather
    than random samples.
  - Stratified sampling is often combined with *oversampling* of some smaller
    strata that are of particular interest. 
    - The LFS oversamples residents of Prince Edward Island (PEI) because a
      national random sample would not catch enough PEI residents to accurately
      measure PEI's unemployment rate.
    - Government surveys typically oversample disadvantaged groups..
  - Stratified samples can usually be handled as if they were from a random
    sample, with some adjustments like weighting to account for oversampling.
- A ***cluster sample*** is gathered by dividing the population into
  ***clusters***, randomly selecting some of these clusters, and sampling cases
  within the cluster.  
  - Educational data sets are often gathered this way: we pick a random sample
    of schools, and then collect data from each student within those schools.
  - Cluster samples can usually be handled as if they were from a random sample,
    with some adjustments like slightly different formulas for standard errors.
- A ***census*** gathers data on every case in the population. 
  - For example, we might have data on all fifty US states, or all ten Canadian
    provinces, or all of the countries of the world. 
  - Data from administrative sources such as tax records or school records often
    cover the entire population of interest as well. 
  - Censuses are often treated as random samples from some imaginary
    "super-population" of cases that *could* have occured.
- A ***convenience sample*** is gathered by whatever method is convenient.
  - For example, we might gather a survey from people who walk by, or we might
    recruit our friends to participate in the survey.
  - Convenience samples are the worst-case scenario; in many cases they simply
    aren't usable for accurate statistical analysis.
 
Many data sets combine several of these elements.  For example, Canada's
unemployment rate is calculated using data from the Labour Force Survey (LFS).
The LFS is built from a stratified sample of the civilian non-institutionalized
working-age population of Canada. There is also some clustering: the LFS will
typically interview whole households, and will do some geographic clustering
to save on travel costs. The LFS is gathered monthly, and the resulting
unemployment rate is a time series.

### Missing data and sample selection

Random samples and their close relatives have the feature that they are
***representative*** of the population from which they are drawn. That is,
the law of large numbers implies that any sufficiently large random sample
tends to closely resemble the population.

Unfortunately, a simple random sample is not always possible. Even if we are
able to randomly select cases, we often run into various forms of missing
data from those cases:

- ***Nonresponse*** occurs when a sampled individual refuses or is unable to
  provide the information requested by the survey
  - ***Survey-level*** nonresponse occurs when the sampled individual does not
    answer any questions. 
      - This can occur if the sampled individual cannot be found, refuses to
        answer, or cannot answer (for example, is incapacitated due to illness
        or disability).
      - Recent response rates to telephone surveys have been around 9\%,
        implying over 90\% of those contacted do not respond.
  - ***Item-level*** nonresponse occurs when the sampled individual does not
    answer a particular question.  
    - This can occur if the respondent refuses to answer, or the question is not
      applicable or has no valid answer.
    - Item-level nonresponse is particularly common on sensitive questions
      like income, illegal or socially disapproved activity, alcohol and drug
      use, etc.
- ***Censoring*** occurs when a particular quantity of interest cannot be
  observed for a particular case.  Censored outcomes are extremely common in
  economics, for example:
    - In labour market analysis, we cannot observe the market wage for
      individuals who are not currently employed. 
    - In supply/demand analysis, we only observe quantity supplied and quantity
      demanded at the current market price.

When observations are subject to nonresponse or censoring, we must interpret the
data carefully.

::: example
**Wald's airplanes**

Abraham Wald was a Hungarian/American statistician and econometrician who made 
important contributions to both the theory of statistical inference and the
development of economic index numbers such as the Consumer Price Index. 

Like many scientists of his time, he assisted the US government's war effort
during World War II. As part of his work, his team was provided with data on
combat damage received by airplanes, with the hopes that the data could be used
to help make the planes more robust to damage. Planes can be reinforced with
additional steel, but this is costly and makes them heavier and slower.

The data looked something like this[^703] (this isn't the real data, just a
visualization constructed for the example):
![Survivorship-bias](bin/WaldPlane.png)

[^703]: Martin Grandjean (vector), McGeddon (picture), Cameron Moll (concept),
CC BY-SA 4.0; https://creativecommons.org/licenses/by-sa/4.0/deed.en;, via
Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Survivorship-bias.svg.

You will notice that most of the damage seems to be in the wings and in the
middle of the fuselage (body), while there is little damage to the nose,
engines, and back of the fuselage. You might think that means that the wings and
middle fuselage are the areas that should be reinforced.

Wald's team realized that this was wrong: the damage data was collected from
planes *that returned*, which is not a random sample of planes that went out.
Planes were probably shot in the nose, the engines, and the back of the fuselage
just as often as anywhere else, but they did not appear often in the data
because they crashed. Wald's insight led to a counter-intuitive policy
recommendation: reinforce the parts of the plane that usually show the *least*
damage.
:::

There are two basic solutions to missing data:

- ***Reinterpretation***: we redefine the population so that our data
  can be correctly interpreted as a random sample from that population.
  - Abraham Wald's team made a distinction between a random sample of planes
    that flew, and a random sample of planes that returned.
  - Canadian survey data can be interpreted as a a random sample of Canadians
    who answer surveys.
- ***Imputation***: we assume or ***impute*** values for all missing quantities.
  - Abraham Wald's team used its domain knowledge to assume that the
    distribution of damage was roughly uniform across all planes that flew.
  - We might assume that Canadians who do not respond to surveys would give
    similar responses to those given by survey respondents.
  - We might assume that Canadians who do not respond to surveys would give
    similar responses to those given by survey respondents of the same age,
    gender, and education.

Like many real-world statistical issues, nonresponse and censoring do not have
a purely technical solution.  Careful thought and domain knowledge are both
required.  If we are imputing values, do we believe that our imputation method
is reasonable?  If we are redefining the population, is the redefined population
one we care about?  There are no clearly right answers to these questions, and
sometimes our data are simply not good enough to answer our questions.

::: {.fyi data-latex=""}
**Nonresponse bias in US presidential elections**

Going into both the 2016 and 2020 US presidential elections, polls indicated
that the Democratic candidate had a substantial lead over the Republican
candidate: 

- Hillary Clinton led Donald Trump by 4-6\% nationally in 2016 
- Joe Biden led Trump by 8\% nationally in 2020.  

The actual vote was much closer:

- Clinton won the popular vote (but lost the election) by 2\%
- Biden won the popular vote (and won the election) by about 4.5\%.  

The generally accepted explanation among pollsters for the clear disparity
between polls and voting is systematic nonresponse: for some reason, Trump
voters are less likely to respond to polls. Since most people do not respond to
standard telephone polls any more (response rates are typically around 9\%),
it does not take much difference in response rates to produce a large difference
in responses.  For example, suppose that:

- We call 1,000 voters
- These voters are equally split, with 500 supporting Biden and 500
  supporting Trump.
- 10\% of Biden voters respond (50 voters)
- 8\% of Trump voters respond (40 voters)

The overall response rate is $90/1000 = 9\%$ (similar to what we usually see in surveys),
Biden has the support of $50/90 = 56\%$ of the respondents while Trump has the
support of $40/90 = 44\%$.  Actual support is even, but the polls show a 12
percentage point gap in support, entirely because of the small difference in
response rates.

Polling organizations employ statisticians who are well aware of this problem,
and they made various adjustments after the 2016 election to address it. For
example, most now weight their analysis by education, since more educated people
tend to have a higher response rate. Unfortunately, the 2020 results indicate
that this adjustment was not enough to produce accurate predictions, and the
2024 results faced similar problems.
:::


## Chapter review {-#review-statistics}

Statistics is the core subject of this course, making this chapter the most
important one in the entire book.

In this chapter we have learned to model a data generating process, describe the
probability distribution of a statistic, and interpret a statistic as estimating
some unknown parameter of the underlying data generating process.

An estimator is rarely identical to the parameter of interest, so any
conclusions based on estimating a parameter of interest have a degree of
uncertainty.  To describe this uncertainty in a rigorous and quantitative
manner, we will next learn some principles of 
[statistical inference](#statistical-inference). 

## Practice problems {-#problems-statistics}

Answers can be found in the [appendix](#answers-statistics).

**GOAL #1: Describe the joint probability distribution of a very simple data set**

1. Suppose we have a data set $D_n = (x_1,x_2)$ that is a random sample
   of size $n = 2$ on the random variable $x_i$ which has discrete PDF:
    \begin{align}
      f_x(a) &= \begin{cases}
                  0.4 & a = 1 \\
                  0.6 & a = 2 \\
              \end{cases}
    \end{align}
   Let $f_{D_n}(a,b) = \Pr(x_1=a \cap x_2 = b)$ be the joint PDF of the data set
   a. Find the support $S_{D_n}$.
   b. Find $f_{D_n}(1,1)$.
   c. Find $f_{D_n}(2,1)$.
   d. Find $f_{D_n}(1,2)$.
   e. Find $f_{D_n}(2,2)$.

**GOAL #2: Identify the key features of a random sample**

2. Suppose we have a data set $D_n = (x_1,x_2)$ of size $n = 2$.  For
   each of the following conditions, identify whether it implies
   that $D_n$ is (i) definitely a random sample; (ii) definitely
   not a random sample; or (iii) possibly a random sample.
   a. The two observations are independent and have the same mean 
      $E(x_1) = E(x_2) = \mu_x$.
   b. The two observations are independent and have the same mean 
      $E(x_1) = E(x_2) = \mu_x$ and variance $var(x_1)=var(x_2)=\sigma_x^2$.
   c. The two observations are independent and have different means 
      $E(x_1) \neq E(x_2)$.
   d. The two observations have the same PDFs, and are independent.
   e. The two observations have the same PDFs, and have 
      $corr(x_1,x_2) = 0$
   f. The two observations have the same PDFs, and have $cov(x_1,x_2) > 0$.

**GOAL #3: Classify data sets by sampling types**

3. Identify the sampling type (random sample, time series, stratified sample,
   cluster sample, census, convenience sample) for each of the following data 
   sets.
   a. A data set from a survey of 100 SFU students who I found waiting in line
      at Tim Horton's.
   b. A data set from a survey of 1,000 randomly selected SFU students.
   c. A data set from a survey of 100 randomly selected SFU students from each
      faculty.
   d. A data set that reports total SFU enrollment for each year from 2005-2020.
   e. A data set from administrative sources that describes demographic information
      and postal code of residence for all SFU students in 2020.

**GOAL #4: Find the sampling distribution of a very simple statistic**

4. Suppose we have the data set described in question 1 above. Find the support
   $S$ and sampling distribution $f(\cdot)$ for:
    a. The sample frequency $\hat{f}_1 = \frac{I(x_1=1) + I(x_2=1)}{2}$.
    b. The sample average $\bar{x} = (x_1 + x_2)/2$.
    c. The sample variance $sd_x^2 = (x_1-\bar{x})^2 + (x_2-\bar{x})^2$.
    d. The sample standard deviation $sd_x = \sqrt{sd_x^2}$.
    e. The sample minimum $xmin = \min(x_1, x_2)$.
    f. The sample maximum $xmax = \max(x_1, x_2)$.

**GOAL #5: Find the mean and variance of a statistic from its sampling distribution**

5. Suppose we have the data set described in question 1 above. Find the
   expected value of:
    a. The sample frequency $\hat{f}_1 = \frac{I(x_1=1) + I(x_2=1)}{2}$.
    b. The sample average $\bar{x} = (x_1 + x_2)/2$.
    c. The sample variance $sd_x^2 = (x_1-\bar{x})^2 + (x_2-\bar{x})^2$.
    d. The sample standard deviation $sd_x = \sqrt{sd_x^2}$.
    e. The sample minimum $xmin = \min(x_1, x_2)$.
    f. The sample maximum $xmax = \max(x_1, x_2)$.

6. Suppose we have the data set described in question 1 above. Find the
   variance of:
    a. The sample frequency $\hat{f}_1 = \frac{I(x_1=1) + I(x_2=1)}{2}$.
    b. The sample average $\bar{x} = (x_1 + x_2)/2$.
    c. The sample minimum $xmin = \min(x_1, x_2)$.
    d. The sample maximum $xmax = \max(x_1, x_2)$.

**GOAL #6: Find the mean and variance of a statistic that is linear in the data**

7. Suppose we have a data set $D_n = (x_1,x_2)$ that is a random sample
   of size $n = 2$ on the random variable $x_i$ which has mean $E(x_i) = 1.6$
   and variance $var(x_i) = 0.24$. Find the mean and variance of:
    a. The first observation $x_1$.
    b. The sample average $\bar{x} = (x_1 + x_2)/2$.
    c. The weighted average $w = 0.2*x_1 + 0.8*x_2$.

**GOAL #7: Distinguish between parameters, statistics, and estimators**

8. Suppose $D_n$ is a random sample of size $n=100$ on a random variable
   $x_i$ which has the $N(\mu,\sigma^2)$ distribution.  Which of the following 
   are unknown parameters of the DGP? Which are statistics calculated from the 
   data?
   a. $D_n$
   b. $n$
   c. $x_i$
   d. $i$
   e. $N$
   f. $\mu$
   g. $\sigma^2$
   h. $E(x_i)$
   i. $E(x_i^3)$
   j. $var(x_i)$
   k. $sd(x_i)/\sqrt{n}$
   l. $\bar{x}$

9. Suppose we have the data set described in question 1 above. Find the true
   value of:
    a. The probability $\Pr(x_i=1)$.
    b. The population mean $E(x_i)$.
    c. The population variance $var(x_i)$.
    d. The population standard deviation $sd(x_i)$.
    e. The population minimum $\min(S_x)$.
    f. The population maximum $\max(S_x)$.

**GOAL #8: Calculate the sampling error of an estimator**

10. Suppose we have the data set described in question 1 above. Suppose we use
    the sample maximum $xmax = \max(x_1,x_2)$ to estimate the population maximum
    $\max(S_x)$.
     a. Find the support $S_{err}$ of the sampling error $err = \max(x_1,x_2) - max(S_x)$.
     b. Find the PDF $f_{err}(\cdot)$ for the sampling distribution of the 
        sampling error $err$.

**GOAL #9: Calculate bias and classify estimators as biased or unbiased**

11. Suppose we have the data set described in question 1 above. Classify each of
    the following estimators as biased or unbiased, and calculate the bias:
    a. The sample frequency $\hat{f}_1$ as an estimator of the probability
       $\Pr(x_i=1)$.
    b. The sample average $\bar{x}$ as an estimator of the population mean
       $E(x_i)$
    c. The sample variance $sd_x^2$  as an estimator of the population
       variance $var(x_i)$
    d. The sample standard deviation $sd_x$ as an estimator of the
       population standard deviation $sd(x_i)$
    e. The sample minimum $xmin$  as an estimator of the population minimum
       $\min(S_x)$.
    f. The sample maximum $xmax$  as an estimator of the population maximum
       $\max(S_x)$.

12. Suppose we are interested in the following parameters:
    - The average earnings of Canadian men: $\mu_M$.
    - The average earnings of Canadian women: $\mu_W$.
    - The male-female earnings gap in Canada: $\mu_M - \mu_W$.
    - The male-female earnings ratio in Canada: $\mu_M/\mu_W$. 
    
    and we have calculated the following statistics from a random sample
    of Canadians:
    
    - The average earnings of men in our sample $\bar{y}_{M}$.
    - The average earnings of women in our sample $\bar{y}_{W}$.
    - The male-female earnings gap in our sample $\bar{y}_{M} - \bar{y}_{W}$.
    - The male-female earnings ratio in our sample $\bar{y}_{M}/\bar{y}_{W}$.
    
    We already know that $\bar{y}_{M}$ is an unbiased estimator of
    $\mu_M$ and $\bar{y}_{W}$ is an unbiased estimator of $\mu_W$.
    
    a. Is the sample earnings gap $\bar{y}_M - \bar{y}_W$ a biased or unbiased
       estimator of the population gap $\mu_M - \mu_W$? Explain.
    b. Is the sample earnings ratio $\bar{y}_M/\bar{y}_W$ a biased or unbiased
       estimator of the population earnings ratio $\mu_M/\mu_W$? Explain.

**GOAL #10: Calculate the mean squared error of an estimator**

13. Suppose we have the data set described in question 1 above. Calculate the
    mean squared error for:
    a. The sample frequency $\hat{f}_1$ as an estimator of the probability
       $\Pr(x_i=1)$.
    b. The sample average $\bar{x}$ as an estimator of the population mean
       $E(x_i)$.
    c. The sample minimum $xmin$  as an estimator of the population minimum
       $\min(S_x)$.
    d. The sample maximum $xmax$  as an estimator of the population maximum
       $\max(S_x)$.

**GOAL #11: Apply MVUE and MSE criteria to select an estimator**

14. Suppose you have a random sample of size $n=2$ on the random variable 
    $x$ with mean $E\left(x\right)=\mu$ and variance $var(x_i)=\sigma^2$. Two
    potential estimators of $\mu$ are the sample average:
    \begin{align}
      \bar{x} = \frac{x_1 + x_2}{2}
    \end{align}
    and the last observation:
    \begin{align}
      x_2
    \end{align}
    a. Are these estimators biased or unbiased?
    b. Find $var(\bar{x})$.
    c. Find $var(x_2)$.
    d. Find $MSE(\bar{x})$.
    e. Find $MSE(x_2)$.
    f. Which estimator is preferred under the MVUE criterion?
    g. Which estimator is preferred under the MSE criterion?

**GOAL #12: Calculate the standard error for a sample average**

15. Suppose that we have a random sample $D_n$ of size $n=100$ on the random
    variable $x_i$ with unknown mean $\mu$ and unknown variance $\sigma^2$.
    Suppose that the sample average is $\bar{x} = 12$ and the sample variance is
    $sd^2 = 4$.  Find the standard error of $\bar{x}$.

**GOAL #13: Explain the law of large numbers and what it means for an estimator to be consistent**

16. Suppose we have a random sample of size $n$ on the random variable $x_i$
    with mean $E(x_i) = \mu$.  Which of the following statistics are consistent
    estimators of $\mu$?
    a. The sample average $\bar{x}$
    b. The sample median.
    c. The first observation $x_1$.
    d. The average of all even-numbered observations.
    e. The average of the first 100 observations.