-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path07-Statistics.Rmd
1766 lines (1437 loc) · 74.1 KB
/
07-Statistics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Statistics {#statistics}
```{r setup7, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
prompt = FALSE,
tidy = TRUE,
collapse = TRUE)
library("tidyverse")
library("cowplot")
```
In earlier chapters, we learned to use Excel to construct common
[univariate statistics and charts](#basic-data-analysis-with-excel). We also
learned the basics of [probability theory](#probability-and-random-events), and
working with [simple](#random-variables) or [complex](#more-on-random-variables)
random variables. The next step is to bring these concepts together, and apply
the theoretical tools of probability and random variables to statistics
calculated from data.
This chapter will develop the theory of mathematical statistics, which treats
our data set and each statistic calculated from the data as the outcome of a
random data generating process. We will also explore one of the most important
uses of statistics: to ***estimate*** or guess at the value at, some unknown
feature of the data generating process.
::: {.goals data-latex=""}
***Chapter goals***
In this chapter, we will learn how to:
1. Describe the joint probability distribution of a very simple data set.
2. Identify the key features of a random sample.
3. Classify data sets by sampling types.
4. Find the sampling distribution of a very simple statistic.
5. Find the mean and variance of a statistic from its sampling distribution.
6. Find the mean and variance of a statistic that is linear in the data.
7. Distinguish between parameters, statistics, and estimators.
8. Calculate the sampling error of an estimator.
9. Calculate bias and classify estimators as biased or unbiased.
10. Calculate the mean squared error of an estimator.
11. Apply MVUE and MSE criteria to select an estimator.
12. Calculate the standard error for a sample average.
13. Explain the law of large numbers and what it means for an estimator to be
consistent.
:::
To prepare for this chapter, please review both the
[introductory](#random-variables) and [advanced](#more-on-random-variables)
chapters on random variables, as well as the sections in the data analysis
chapter on [summary statistics](#summary-statistics) and
[frequency tables](#frequency-tables).
## Using statistics
Statistics are just numbers calculated from data. Modern computers make
statistics easy to calculate, and they are easy to interpret as descriptions
of the data.
But that is not the only possible interpretation of a statistic, and it is not
even the most important one. Instead, we regularly use statistics calculated
from data to infer or predict other quantities that are *not* in the data.
1. Statistics Canada may conduct a survey of a few thousand Canadians, and
use statistics based on that survey to ***infer*** how the other
40+ million Canadians would have responded to that survey.
- This is the main application we will consider in this course.
2. Wal-Mart may use historical sales data to ***predict*** how many chocolate
bunnies it will sell this Easter. It will then use this prediction to
determine how many chocolate bunnies to order.
- We will talk a little about this kind of application.
3. Economists and other researchers will often be interested in making
***causal*** or ***counterfactual*** inferences.
- Counterfactual inferences are predictions about how the data would have
been different under other (counterfactual) circumstances.
- Economic fundamentals like supply and demand curves are primarily
counterfactual because they describe how much would have been bought or
sold at *each* price (not just the equilibrium price).
- Causal inferences are inferences about the underlying mechanism that
produced the data.
- For example, labour economists are often interested in whether and how much
the typical individual's earnings would increase if they spent one more
year in school, or obtained a particular educational credential.
- Counterfactual and causal inference are beyond the scope of this course,
but are important in applied economics and may be covered extensively in
later courses.
Anyone can make predictions, and almost anyone can calculate a few statistics
in Excel. The hard part is making *accurate* predictions, and selecting
or constructing statistics that will tend to produce accurate predictions.
In order to do that, we will need to construct a probabilistic model that
describes both the random process that generated the data and the process we
follow to construct predictions from the data.
::: example
**Using data to predict roulette outcomes**
Our probability calculations for roulette have relied on two pieces of
knowledge:
- *We know the game's structure*: there are 37 numbered slots, 18 numbers are
red, 18 are black, and one is green.
- *We know the game is fair*: the ball is equally likely to land in all
37 numbered slots.
In addition, the game is simple enough that we can do all of the calculations.
But what if we do not know the structure of the game, are not sure the game is
fair, or the game is too complicated to calculate the probabilities?
If we have access to a data set of past results, we can use that data set to:
1. *estimate* the win probability of various bets.
- This application will be covered in the current chapter.
2. *test* the claim that the game is fair.
- This application will be covered in the chapter on
[statistical inference](#statistical-inference)
This approach will be particularly useful for games like poker or blackjack that
are more complex and/or involve human decision making. The win probability in
blackjack depends on choices made by the player, so the house advantage can vary
depending on who is playing, their state of mind (are they distracted,
intoxicated, or trying to show off?), and various other human factors.
:::
## Data and the data generating process
We will start by assuming for the rest of this chapter that we have a
***data set*** or ***sample*** called $D_n$. In most applications, it will be
a [tidy data set](#tidy-data) with $n$ observations (rows) and $K$ numeric
variables (columns). For this chapter, we will further simplify by assuming that
$K = 1$, i.e., that $D_n = (x_1,x_2,\ldots,x_n)$ contains $n$ observations on a
single numeric variable $x_i$. This case will cover all of the univariate
statistics and methods described in
[Chapter 3: Basic data analysis with Excel](#basic-data-analysis-with-excel).
::: example
**Data from two roulette games**
Suppose we have a data set $D_n$ providing the result of $n = 2$ independent
games of roulette. Let $x_i$ be the result of a bet on red:
\begin{align}
x_i = \begin{cases} 1 & \textrm{if Red wins game } i \\ 0 & \textrm{if Red loses game } i \\ \end{cases}
\end{align}
Then $D_n = (x_1,x_2)$ where $x_1$ is the result from the first game and $x_2$
is the result from the second game.
For example, suppose red wins the first game and loses the second game. Then our
data could be written in a table as:
| Game \# ($i$) | Result of bet on red ($x_i$) |
|:--------------|:----------------------------:|
| 1 | 1 |
| 2 | 0 |
or in a list as $D_n = (1,0)$.
This is the simplest possible example, so we can learn the concepts with the
least possible amount of arithmetic. To make sure you understand the examples
in this chapter, re-do them with the *three*-game data set $D_n = (0,1,0)$.
:::
### Data as random variables
Our data set $D_n$ is a table or list of *numbers*. We can also think of it
as a set of *random variables* with an unknown joint PDF $f_D$. This PDF is
sometimes called the ***data generating process*** or DGP for the data set.
This is the fundamental conceptual step in the entire course, so you should
pause for a moment to make sure you understand it. We are thinking of our data
set as two distinct things:
1. The specific set of numbers in front of us.
2. The outcome of some random process that generated those specific numbers this
time, but could easily have generated other numbers instead.
The goal of statistical analysis is to use the specific set of numbers in front
of us to learn something new about the random process that generated
those specific numbers.
::: example
**The DGP for our roulette data**
The DGP of our two-game roulette data set is just the joint PDF of
$D_n=(x_1,x_2)$:
\begin{align}
f_D(a,b) &= \Pr(x_1 = a \cap x_2 = b)
\end{align}
where $a$ and $b$ are any real numbers.
:::
### The support of a data set
The support of the data set $D_n = (x_1,x_2,\ldots,x_n)$ is just the set of all
length-$n$ sequences of numbers that can be constructed from the support of
$x_i$. There are $|S_x|^n$ such sequences, where $S_x$ is the support of $x_i$.
::: example
**The support for our roulette data**
Our two-game roulette data set has a discrete support that includes four
possible values corresponding to the four possible length-2 sequences that
can be constructed from $S_x = \{0,1\}$:
\begin{align}
S_D = \{(0,0), (0,1), (1,0), (1,1)\}
\end{align}
Note that the order matters here: the outcome $(0,1)$ (red loses game 1 and wins
game 2) is a different outcome from $(1,0)$ (red wins game 1 and loses game 2).
:::
Most real-world data sets have enormous support. For example, our roulette data
set is just about the simplest possible meaningful data set, but the support
for a data set with 100 games would have
$2^{100} = 1,267,650,600,228,229,401,496,703,205,376$ distinct values. Most
data sets we analyze have many more observations and many more variables than
that, so their support would be even larger.
### The DGP
The exact DGP is usually unknown. But in many cases, we know something about
the underlying process and can make some reasonable assumptions based on what
we know. This can simplify the DGP in ways that will be helpful.
::: example
**Simplifying the DGP of our roulette data**
The DGP for our two-game roulette data set involves four[^702] unknown joint
probabilities, one for each element of the support.
[^702]: It might be more accurate to say it involves only three unknown
probabilities. Since we know the probabilities will sum up to one, if we know
three of the four we can calculate the fourth.
Based on what we know about the game of roulette, we can reasonably assume that
results of different games are independent and that red has the same win
probability in each game. Then the DGP can be written:
\begin{align}
f_D(0,0) &= \Pr(x_1 = 0 \cap x_2 = 0) \\
&= \Pr(x_1 = 0) *\Pr(x_2 = 0) \qquad \textrm{(by independence)}\\
&= (1-p)^2 \\
f_D(0,1) &= \Pr(x_1 = 0 \cap x_2 = 1) \\
&= \Pr(x_1 = 0) *\Pr(x_2 = 1) \qquad \textrm{(by independence)}\\
&= (1-p)*p \\
f_D(1,0) &= \Pr(x_1 = 1 \cap x_2 = 0) \\
&= \Pr(x_1 = 1) *\Pr(x_2 = 0) \qquad \textrm{(by independence)}\\
&= p*(1-p) \\
f_D(1,1) &= \Pr(x_1 = 1 \cap x_2 = 1) \\
&= \Pr(x_1 = 1) *\Pr(x_2 = 1) \qquad \textrm{(by independence)}\\
&= p^2 \\
f_D(a,b) &= 0 \qquad \textrm{otherwise}
\end{align}
where $p = \Pr(x_i = 1)$ is the unknown probability that a bet on red wins.
Note that the DGP of $D_n$ is still unknown, but now it can be described in
terms of a single unknown parameter $p$ rather than the full set of four unknown
joint probabilities.
:::
While it is feasible to calculate the DGP for a very small data set, it
quickly becomes impractical to do so as the number of observations increase
and the set of possibilities to consider becomes enormous. Fortunately, we
rarely need to calculate the DGP. We just need to understand that it *could* be
calculated.
### Simple random sampling
In most applications, we assume that $D_n$ is
***independent and identically distributed*** (IID) or a
***simple random sample*** from a large ***population***. A simple random sample
has two features:
1. All observations are **independent**: Each $x_i$ is an independent random
variable.
2. All observations are **identically distributed**: Each $x_i$ has the same
(unknown) marginal distribution.
Random sampling dramatically simplifies the DGP. The joint PDF of a simple
random sample can be written:
\begin{align}
\Pr(D_n = (a_1,a_2,\ldots,a_n)) = f_x(a_1)f_x(a_2)\ldots f_x(a_n)
\end{align}
where $f_x(a) = \Pr(x_i = a)$ is just the marginal PDF of a single observation.
Independence allows us to write the joint PDF as the product of the marginal
PDFs for each observation, and identical distribution allows us to use the same
marginal PDF for each observation. This reduces the number of unknown numbers
in the DGP from $|S_x|^n$ (the support of $D_n$) to $|S_x|$
(the support of $x$, which is much smaller).
The reason we call this "independent and identically distributed" is hopefully
obvious, but what does it mean to say we have a "random sample" from a
"population"? Well, one simple way of generating an IID sample is to:
1. Define the population of interest, for example all Canadian residents.
2. Use some purely random mechanism[^602] to choose a small subset of cases
from this population.
- The subset is called our ***sample***
- "Purely random" here means some mechanism like a computer's random number
generator, which can then be used to dial random telephone numbers or
select cases from a list.
3. Collect data from every case in our sample.
This process will generate a data set that is independent and identically
distributed.
[^602]: As a technical matter, the assumption of independence requires
that we sample *with replacement*. This means we allow
for the possibility that we sample the same case more than once.
In practice this doesn't matter as long as the sample is small
relative to the population.
::: example
**Our roulette data is a random sample**
Each observation $x_i$ in our two-game roulette data set is an independent
random draw from the $Bernouilli(p)$ distribution where
$p = \Pr(\textrm{Red wins})$.
Therefore, this data set satisfies the criteria for a simple random sample.
:::
Random sampling is at the core of basic statistical analysis for two reasons:
1. It is simple to implement.
2. Results shown later in this chapter imply that a moderately-sized random
sample provides surprisingly accurate information on the underlying
population.
However, it is not the only possible sampling process. Alternatives to simple
random sampling will be discussed later in this chapter.
## Statistics and their properties
A ***statistic*** is just a number $s_n =s(D_n)$ that is calculated from the
data. In general, the value of any statistic is:
- Observed/known since the data set $D_n$ is observed/known.
- A random variable with a probability distribution that is *well-defined* but
*unknown*. This is because the data set $D_n$ is a set of random variables
with the same characteristics.
I will use $s_n$ to represent a generic statistic, but we will often use
other letters to talk about specific statistics.
::: example
**Roulette wins**
In our two-game roulette data set, the total number of wins is:
\begin{align}
R = x_1 + x_2
\end{align}
Since this is a number calculated from our data, it is a statistic.
We can think of $R$ as a specific value for our specific data set
$D_n = (1,0)$:
\begin{align}
R = 1 + 0 = 1
\end{align}
We can also think of it as a random variable whose value would have been
different if the data were different. Since $x_1$ and $x_2$ are
independent draws from the $Bernoulli(p)$ distribution, the total number of
wins has a binomial distribution:
\begin{align}
R \sim Binomial(2,p)
\end{align}
This distribution is unknown because the true value of $p$ is unknown.
:::
### Summary statistics {#summary-statistics-theory}
The [univariate summary statistics](#summary-statistics) we previously learned
to calculate in Excel will serve as our main examples.
::: example
**Summary statistics for our roulette data**
We can calculate the usual summary statistics for our two-game roulette data
set:
| Statistic | Formula | In roulette data |
|:---------------|:----------------------------:|:----------------------------:|
| Sample size (count) | $n$ | $2$ |
| Sample average | $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ | $\frac{1}{2} (1 + 0) = 0.5$ |
| Sample variance | $\hat{sd}_x^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i-\bar{x})^2$ | $\frac{1}{2-1} (( 1-0.5)^2 + (0-0.5)^2) = 0.5$ |
| Sample std dev. | $\hat{sd}_x = \sqrt{\hat{sd}_x^2}$ | $\sqrt{0.5} \approx 0.71$ |
| Sample median | $\hat{m} =\frac{x_{[n/2]} + x_{[(n/2) + 1]}}{2}$ if $n$ is even | $\frac{x_{[1]} + x_{[2]}}{2} = \frac{0 + 1}{2} = 0.5$ |
| | $\hat{m} = x_{[(n/2) + (1/2)]}$ if $n$ is odd ||
:::
We also learned to construct both [simple](#simple-frequency-tables) and
[binned](#binned-frequency-tables) frequency tables. Let
$B \subset \mathbb{R}$ be a bin of values. Each bin would contain a single value for a simple frequency table, or multiple values for a binned frequency table.
Given a particular bin, we can define:
- The ***sample frequency*** or ***relative sample frequency*** of bin $B$ is
the proportion of cases in which $x_i$ is in $B$:
\begin{align}
\hat{f}_B = \frac{1}{n} \sum_{i=1}^n I(x_i \in B)
\end{align}
- The ***absolute sample frequency*** of bin $B$ is the *number* of cases in
which $x_i$ is in $B$:
\begin{align}
n \hat{f}_B = \sum_{i=1}^n I(x_i \in B)
\end{align}
We can then construct each cell in a frequency table by choosing the appropriate
bin.
::: example
**Frequency statistics for our roulette data**
We can calculate the usual summary statistics for our two-game roulette data:
| Statistic | Formula | In roulette example |
|:---------------|:----------------------------:|:----------------------------:|
| Relative frequency | $\hat{f}_B = \frac{1}{n} \sum_{i=1}^n I(x_i \in B)$ | depends on $B$ |
| $\quad B=\{0\}$ | $\hat{f}_0 = \frac{1}{n} \sum_{i=1}^n I(x_i = 0)$ | $\frac{1}{2}(0 + 1) = 0.5$ |
| $\quad B=\{1\}$ | $\hat{f}_1 = \frac{1}{n} \sum_{i=1}^n I(x_i = 1)$ | $\frac{1}{2}(1 + 0) = 0.5$ |
| Absolute frequency | $n\hat{f}_B = \sum_{i=1}^n I(x_i \in B)$ | depends on $B$ |
| $\quad B=\{0\}$ | $n\hat{f}_0 = \sum_{i=1}^n I(x_i = 0)$ | $(0 + 1) = 1$ |
| $\quad B=\{1\}$ | $n\hat{f}_1 = \sum_{i=1}^n I(x_i = 1)$ | $(1 + 0) = 1$ |
:::
### The sampling distribution
We call the probability distribution of a statistic its
***sampling distribution***. In principle, the sampling distribution of any
statistic can be directly derived from the DGP of its data. The sampling
distribution is therefore:
- Unknown since the DGP $f_D$ is unknown.
- Fixed (non-random) since the DGP is a function $f_D$ and not a random
variable.
In practice, the sampling distribution is difficult to calculate outside of a
few simple examples. The important part is to understand what a sampling
distribution is, that every statistic has one, and that it depends on the
(usually unknown) DGP.
::: example
**The sampling distribution of the sample average in our roulette data**
In our two-game roulette data set, the sample average is:
\begin{align}
\bar{x} = \frac{1}{2} (x_1 + x_2)
\end{align}
Since there are four possible values of $(x_1,x_2)$, we can determine the
sampling distribution of the sample average by enumeration.
| Data ($D_2$) | Probability ($f_D$) | Sample Average ($\bar{x}$) |
|:-------------|:-------------------:|:--------------------------:|
| $(0,0)$ | $(1-p)^2$ | $0.0$ |
| $(0,1)$ | $p(1-p)$ | $0.5$ |
| $(1,0)$ | $p(1-p)$ | $0.5$ |
| $(1,1)$ | $p^2$ | $1.0$ |
Therefore, the sampling distribution of $\bar{x}$ in this data set can be
described by the PDF:
\begin{align}
f_{\bar{x}}(a) \equiv Pr(\bar{x}=a)
&= \begin{cases}
(1-p)^2 & \textrm{if $a=0$} \\
2p(1-p) & \textrm{if $a=0.5$} \\
p^2 & \textrm{if $a=1$} \\
0 & \textrm{otherwise} \\
\end{cases}
\end{align}
and the support of $\bar{x}$ is $S_{\bar{x}} = \{0,0.5,1\}$.
:::
### The mean {#the-mean-of-a-statistic}
Since a statistic has a probability distribution, it has an expected value[^701]
(mean).
[^701]: All random variables we will see in this course will have an expected
value, but it is possible for a random variable to have a well-defined PDF but
not a well-defined expected value. For example, if $x \sim N(0,1)$, then $y=1/x$
has this property.
::: example
**The mean of the sample average in the roulette data**
We can calculate the expected value of $\bar{x}$ in the two-game roulette
data set directly from its PDF, which we derived in the previous example:
\begin{align}
E(\bar{x}) &= \sum_{a \in S_{\bar{x}}} a f_{\bar{x}}(a) \\
&= 0 \times f_{\bar{x}}(0) + 0.5 \times f_{\bar{x}}(0.5) + 1 \times f_{\bar{x}}(1) \\
&= 0 \times (1-p)^2 + 0.5 \times 2p(1-p) + 1.0 \times p^2 \\
&= p
\end{align}
:::
As mentioned earlier, it is often impractical or impossible to calculate the
complete sampling distribution for a given statistic. Fortunately, we do not
always need the complete sampling distribution to calculate the mean.
::: {.example #mean-of-sample-average}
**Another way to find the mean of the sample average**
The sample average is just a sum, so in our two-game roulette data set:
\begin{align}
E(\bar{x}) &= E\left(\frac{1}{2}(x_1 + x_2)\right) \\
&= \frac{1}{2}\left(E(x_1) + E(x_2)\right) \qquad \textrm{by linearity} \\
&= \frac{1}{2} (p + p) \quad \textrm{since $E(x_i)=p$} \\
&= p
\end{align}
Note that this is the same answer as we derived directly from the PDF.
:::
The results in Example \@ref(exm:mean-of-sample-average) can be generalized to
apply to any sample average in a random sample. More specifically, suppose we
have a simple random sample of size $n$ on the random variable $x_i$ with
unknown mean $E(x_i) = \mu_x$. Then the expected value of the sample average is:
\begin{align}
E(\bar{x}) &= E\left( \frac{1}{n} \sum_{i=1}^n x_i\right) \\
&= \frac{1}{n} \sum_{i=1}^n E\left( x_i\right) \\
&= \frac{1}{n} \sum_{i=1}^n \mu_x \\
&= \mu_x
\end{align}
This is an important result in statistics, so you should follow it step-by-step
to make sure you understand. If you are struggling with it, look at the simple
example first. The key is to recognize that the sample average is a sum, and
so we can apply the linearity of the expected value. We have derived this result
for the specific case of a simple random sample, but it applies for many other
common sampling schemes.
### The variance {#the-variance-of-a-statistic}
Statistics also have a variance and a standard deviation, and they are often
easy to calculate.
::: {.example #variance-of-sample-average}
**The variance of the sample average in the roulette data**
In our two-game roulette data set, the variance of the sample average is:
\begin{align}
var(\bar{x}) &= var\left(\frac{1}{2}(x_1 + x_2)\right) \\
&= \left(\frac{1}{2}\right)^2 var(x_1 + x_2) \\
&= \frac{1}{4} \left( var(x_1) +
\underbrace{2 \, cov(x_1,x_2)}_{\textrm{$= 0$ (by independence)}} + var(x_2) \right) \\
&= \frac{1}{4} \left( 2 \, var(x_i) \right) \\
&= \frac{var(x_i)}{2}
\end{align}
Notice that $var(\bar{x}) < var(x_i)$. Averages are typically less variable
than the thing they are averaging.
:::
The result in Example \@ref(exm:variance-of-sample-average) can be generalized
to the variance of any sample average in a random on the random variable $x_i$
with mean $E(x_i)=\mu_x$ and variance $var(x_i)=\sigma^2$. Then:
\begin{align}
var(\bar{x}) &= \frac{\sigma_x^2}{n} \\
sd(\bar{x}) &= \frac{\sigma_x}{\sqrt{n}}
\end{align}
I won't ask you to prove this, but the proof is just a longer version of Example
\@ref(exm:variance-of-sample-average) above.
::: warning
**Random variables, expected values, and statistics**
Before proceeding, be sure you understand the distinction between:
- The sample average $\bar{x}$ and the value of a single observation $x_i$.
- The expected values $\mu_x = E(x_i)$ and $E(\bar{x})$.
- The variances $\sigma_x^2 = var(x_i)$ and $var(\bar{x})$.
One particularly common mistake is to confuse $\bar{x}$ and $\mu_x$.
:::
## Estimation {#estimation}
One of the most important uses of statistics is to estimate, or guess the value of, some unknown feature of the population or DGP.
### Parameters
A ***parameter*** is an unknown number $\theta = \theta(f_D)$ whose value
depends on the DGP. Since a parameter is constructed from the DGP, its
value is:
- Unobserved/unknown since the DGP $f_D$ is unknown.
- Fixed (not random) since the DGP $f_D$ is a function and not a random
variable.
I will use $\theta$ to represent a generic parameter, but we will often use
other letters to talk about specific parameters.
::: example
**Examples of parameters**
Sometimes a single parameter completely describes the DGP:
- In our two-game roulette data set, the joint distribution of the data depends
only on the (known) sample size $n$ and the single (unknown) parameter
$p = \Pr(\textrm{Red wins})$.
Sometimes a group of parameters completely describe the DGP:
- If $x_i$ is a random sample from the $U(L,H)$ distribution, then $L$ and $H$
are both parameters.
And sometimes a parameter only partially describes the DGP
- If $x_i$ is a random sample from some unknown distribution with unknown mean
$\mu_x = E(x_i)$, then $\mu_x$ is a parameter.
- If $x_i$ is a random sample from some unknown distribution, then
$f_5 = \Pr(x = 5)$ is a parameter.
:::
Typically there will be specific parameters whose value we wish to know. Such
a parameter is called a ***parameter of interest***. The DGP may include other
parameters, which are typically called *auxiliary parameters* or
*nuisance parameters*.
### Estimators
An ***estimator*** is any statistic $\hat{\theta}_n = \hat{\theta}(D_n)$ that is
used to ***estimate*** (guess at the value of) an unknown parameter of interest
$\theta$. Since an estimator is constructed from $D_n$, its value is:
- Observed/known since the data set $D_n$ is observed/known.
- A random variable with a well-defined but unknown probability distribution
since the data set $D_n$ also has those properties.
I will use $\hat{\theta}$ to represent a generic estimator, but we will often
use other notation to talk about specific estimators. The circumflex or "hat"
$\hat{\,}$ notation is commonly used to identify an estimator; for example,
$\hat{\mu}$ may be used to represent an estimator of the parameter $\mu$ and
$\hat{\sigma}$ may be used to represent an estimator of the parameter $\sigma$
::: example
**Two estimators for the win probability**
Consider our two-game roulette data set $D_n = (x_1,x_2) = (0,1)$, and suppose
our parameter of interest is the win probability $p$. I will propose two
estimators for $p$:
1. The sample average:
\begin{align}
\bar{x} &= \frac{1}{2} (x_1 + x_2) \\
&= \frac{1}{2} (0 + 1) \\
&= 0.5
\end{align}
2. The value of the first observation:
\begin{align}
x_1 &= 1
\end{align}
These are both statistics calculated from the data, so they are both potential
estimators for $p$.
:::
An estimator is just a rule for making guesses; any statistic can be used as as
an estimator of any parameter. But we need to pick a specific guess, and we want
our guess to be an accurate one. So we will need some kind of ***criterion***
that allows us to compare different statistics and choose the statistic that
represents the "best" estimator of a particular parameter.
Intuitively, a good estimator is one that is unlikely to be very different from
the true value of the unknown parameter. We can quantify "unlikely" and
"very different" more precisely by introducing the concepts of sampling error,
bias, and mean squared error.
### Sampling error
Let $\hat{\theta}_n$ be a statistic we are using as an estimator of some
parameter of interest $\theta$. We can define its ***sampling error*** as:
\begin{align}
err(\hat{\theta}_n) = \hat{\theta}_n - \theta
\end{align}
In principle, we want $\hat{\theta}_n$ to be a good estimator of $\theta$, i.e.,
we want the sampling error to be as close to zero as possible.
Since the sampling error depends on *both* the estimator $\hat{\theta}_n$ and
the true parameter value $\theta$, its value is:
1. Unknown/unobservable since $\theta$ is unknown.
2. Random since $\hat{\theta}_n$ is random.
Always remember that $err(\hat{\theta}_n)$ is not an inherent property of the
statistic - it depends on the relationship between the statistic and the
parameter of interest. A given statistic may be a good estimator of one
parameter, and a bad estimator of another parameter.
::: example
**Sampling error for our two estimators**
The sampling error for our two estimators is:
\begin{align}
err(\bar{x}) &= \bar{x} - p \\
&= 0.5 - p \\
err(x_1) &= x_1 - p \\
&= 1 - p \\
\end{align}
Notice that these are random variables but we do not know their values
since they depend on the unknown parameter $p$.
:::
### Bias
The ***bias*** of an estimator is defined as its expected sampling error:
\begin{align}
bias(\hat{\theta}_n) &= E(err(\hat{\theta}_n)) \\
&= E(\hat{\theta}_n - \theta) \\
&= E(\hat{\theta}_n) - \theta
\end{align}
Ideally we would want $bias(\hat{\theta}_n)$ to be zero, in which case we would
say that $\hat{\theta}_n$ is an ***unbiased*** estimator of $\theta$.
Since it depends on $E(\hat{\theta}_n)$ and $\theta$, the bias is generally:
- Unobserved/unknown since $E(\hat{\theta}_n)$ and $\theta$ both depend on
the DGP
- Fixed/nonrandom since $E(\hat{\theta}_n)$ and $\theta$ are numbers and not
random variables
Although the bias is *generally* unknown, there are some important cases in
which we can prove that it is zero.
::: example
**Two unbiased estimators**
In our two-game roulette data, the bias of the sample average as an estimator
of $p$ is:
\begin{align}
bias(\bar{x}) &= E(\bar{x}) - p \\
&= p - p \\
&= 0
\end{align}
and the bias of the first observation is:
\begin{align}
bias(x_1) &= E(x_1) - p \\
&= p - p \\
&= 0
\end{align}
Therefore, both of these estimators are unbiased. This example illustrates a
general principle: there is rarely exactly one unbiased estimator. There are
either none, or many.
:::
More generally, suppose we have a random sample of size $n$ on some
random variable $x_i$ with mean $E(x_i) = \mu_x$. We earlier showed in this
case that:
\begin{align}
E(\bar{x}) &= \mu_x
\end{align}
So the sample average $\bar{x}$ is an unbiased estimator of $\mu_x$:
\begin{align}
bias(\bar{x}) &= E((\bar{x}) - E(x_i) \\
&= \mu_x - \mu_x \\
&= 0
\end{align}
This is true for any random sample and any random variable $x_i$.
If the bias is nonzero, we would say that $\hat{\theta}_n$ is a ***biased***
estimator of $\theta$. The exact amount of bias of a particular biased estimator
is usually hard to know because it depends on the unknown true DGP. But
we can sometimes say something about its direction, or about how it relates
to specific parameters of the DGP.
::: example
**A biased estimator of the median**
Consider our two-game roulette data set, and suppose we wish to estimate the
(population) median of $x_i$. Applying our
[definition of the median](#the-median) from an earlier chapter, the median of
$x_i$ is:
\begin{align}
m &= \begin{cases}
0 & \textrm{if $p \leq 0.5$} \\
1 & \textrm{if $p > 0.5$} \\
\end{cases}
\end{align}
A natural estimator of the median is the sample median. In our two-observation
example, the sample median would be:
\begin{align}
\hat{m} &= \frac{1}{2} (x_1 + x_2)
\end{align}
and its expected value would be $E(\hat{m}) = p$.
\begin{align}
E(\hat{m}) &= E\left(\frac{1}{2} (x_1 + x_2) \right) \\
&= \frac{1}{2} \left( E(x_1) + E(x_2) \right) \\
&= \frac{1}{2} \left( p + p \right) \\
&= p
\end{align}
So its bias as an estimator of $m$ is:
\begin{align}
bias(\hat{m}) &= E(\hat{m}) - m \\
&= p - m \\
&= \begin{cases}
p & \textrm{if $p \leq 0.5$} \\
p-1 & \textrm{if $p > 0.5$} \\
\end{cases}
\end{align}
Figure \@ref(fig:medex) below shows the true population median (in blue), the
expected value of the sample median (in orange), and the bias (in red). As you
can see, the bias is nonzero for most values of $p$, but its direction and
magnitude vary.
This result applies more generally: the sample median is typically a biased
estimator of the median. It is still the most commonly-used estimator for the
median, for reasons we will discuss soon.
:::
```{r medex, fig.cap = "*The sample median is a biased estimator*"}
medex <- tibble(p=seq(from=0,to=1,length.out=400),
m=as.integer(p > 0.5),
bias=p-m)
ggplot(data=medex,mapping=aes(x=p,y=m)) +
geom_step(col = "navy",linewidth=1) +
geom_line(aes(y=p),col="darkorange",alpha=0.4,linewidth=2) +
geom_line(aes(y=bias),col="red",linetype=2,linewidth=1) +
geom_text(x=0.43,y=0.9,col="navy",label="Median") +
geom_text(x=0.83,y=0.6,col="darkorange",label="E(sample median)") +
geom_text(x=0.6,y=-0.25,col="red",label="Bias") +
xlab("Win probability (p)") +
ylab("") +
labs(title = "The sample median is a biased estimator",
subtitle = "Two-game roulette example",
caption = "",
tag = "")
```
### Variance and the MVUE
We usually prefer unbiased estimators to biased estimators, but that isn't
enough to pick an estimator. In general, if we can find one unbiased estimator
there are usually many others. So we need to apply at least one more criterion.
A natural second criterion is the ***variance*** of the estimator:
\begin{align}
var(\hat{\theta}_n) = E[(\hat{\theta}_n-(E(\hat{\theta}_n))^2]
\end{align}
Why do we care about the variance?
- If $\hat{\theta}_n$ is unbiased, then $E(\hat{\theta}_n)=\theta$.
- Lower variance means that $\hat{\theta}_n$ is typically closer to
$E(\hat{\theta}_n)$.
Therefore, among unbiased estimators, lower-variance estimators are preferable
because they are typically closer to the true parameter value. We can put this
idea into a formal criterion which is called the "MVUE".
The ***minimum variance unbiased estimator*** (MVUE) of a parameter is the
unbiased estimator with the lowest variance, and the ***MVUE criterion*** for
choosing an estimator says to choose the MVUE.
::: example
**The MVUE for roulette**
Returning to our two-game roulette data set and two proposed estimators
($\bar{x}$ and $x_1$), we can find the MVUE by following these steps:
1. *Calculate the bias* of each proposed estimator. We calculated this earlier:
\begin{align}
bias(\bar{x}) &= 0 \\
bias(x_1) &= 0
\end{align}
2. *Calculate the variance* of each proposed estimator. We calculated this
earlier:
\begin{align}
var(\bar{x}) &= \frac{var(x_i)}{2} \\
var(x_1) &= var(x_i)
\end{align}
3. *Choose the unbiased estimator with the lowest variance*, if there is an
unbiased estimator.
- Both estimators are unbiased.
- The sample average $\bar{x}$ has lower variance $var(\bar{x}) < var(x_1)$.
- Therefore $\bar{x}$ is the MVUE.
:::
### Mean squared error
Once we move beyond the simple case of the sample average, we run
into two major complications with the MVUE criterion:
1. *No unbiased estimator*: An unbiased estimator may not exist for a particular
parameter of interest.
- For example, there is no unbiased estimator of the median, or of any other
quantile.
- If there is no unbiased estimator, there is no MVUE.
- So we need some other way of choosing an estimator.
2. *Bias/variance trade-off*: Sometimes we have both an unbiased estimator with
high variance and another estimator with much lower variance but just a
little bit of bias.
- A detailed example of this case is provided below.
- Here, the unbiased estimator is the MVUE.
- But we may not be happy with this choice if the bias is small enough and
the variance of the unbiased estimator is large enough.
::: example
**The relationship between age and earnings**
Labour economists are often interested in the relationship between age and
earnings. Typically, workers earn more as they get older but earnings do not
increase at a constant rate. Instead, earnings rise rapidly in a typical
worker's 20s and 30s, then gradually flatten out. This pattern affects many
economically important decisions like education, savings, household formation,
having children, etc.
Suppose we want to estimate the earnings of the average 35-year-old Canadian,
and have access to a random sample of 800 Canadians with 10 observations for
each age between 0 and 80.
The average earnings of 35-year-olds in our data would be an unbiased estimator
of the average earnings of 35-year-olds in Canada. However, it would be based
on only 10 observations, and its variance would be very high.
We could increase the sample size and reduce the variance by including
observations from people who are *almost* 35 years old. We have many
options, including:
- Average earnings of the 10 35 year olds in our data.
- Average earnings of the 30 34-36 year olds in our data.
- Average earnings of the 100 30-39 year olds in our data.
- Average earnings of the 800 0-80 year olds in our data.
Widening the age range will reduce the variance of these averages, but will
introduce bias (since they have added people that are not exactly like
our target population of 35-year-olds). It is not clear which age range will
tend to produce the most accurate estimator of the parameter of interest
(average earnings of 35 year olds in Canada).
:::
This set of issues implies that we need a criterion that:
- Can be used to choose between biased estimators.
- Can choose slightly biased estimators with low variance over unbiased
estimators with high variance.
The ***mean squared error*** of an estimator is defined as the expected value
of the squared sampling error:
\begin{align}
MSE(\hat{\theta}_n) &= E[err(\hat{\theta}_n)^2] \\
&= E[(\hat{\theta}_n-\theta)^2]
\end{align}
and the ***MSE criterion*** says to choose the (biased or unbiased) estimator
with the lowest MSE.
While this is the definition of MSE, we can derive a handy formula:
\begin{align}
MSE(\hat{\theta}_n) &= var(\hat{\theta}_n) + [bias(\hat{\theta}_n)]^2
\end{align}
This is the formula we will usually use to calculate MSE. A few things to note
about this formula:
1. Both bias and variance enter into the formula. So all else equal, the MSE
criterion still favors less biased estimators and lower variance estimators.
2. The bias is squared, meaning both positive and negative bias are treated as
equally bad.
::: example
**The MSE for our two estimators**
Returning to our two-game roulette data set, we can apply the MSE criterion to
choose between our proposed estimators by following these steps:
1. *Calculate bias and variance* for each estimator. We have already done this:
\begin{align}
bias(\bar{x}) &= 0 \\
bias(x_1) &= 0 \\
var(\bar{x}) &= \frac{var(x_i)}{2} \\
var(x_1) &= var(x_i)
\end{align}
2. *Calculate MSE* using the variance/bias formula:
\begin{align}
MSE(\bar{x})
&= var(\bar{x}) + [bias(\bar{x})]^2 \\
&= \frac{var(x_i)}{2} + [0]^2 \\
&= \frac{var(x_i)}{2} \\
MSE(x_1)
&= var(x_1) + [bias(x_1)]^2 \\
&= var(x_i) + [0]^2 \\
&= var(x_i)
\end{align}
3. *Choose the estimator with the lowest MSE*. In this case,
$MSE(\bar{x}) < MSE(x_1)$ so $\bar{x}$ is the preferred estimator by the
MSE criterion.
Note that in this example, the sample average is the preferred estimator by both
the MVUE criterion and the MSE criterion. But that will not always be the case.
:::
The MSE criterion allows us to choose a biased estimator with low variance over
an unbiased estimator with high variance, and also allows us to choose between
biased estimators when no unbiased estimator exists.
::: {.fyi data-latex=""}