forked from bvkrauth/is4e
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08-Inference.Rmd
1308 lines (1081 loc) · 56.8 KB
/
08-Inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Statistical inference {#statistical-inference}
```{r setup9, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
prompt = FALSE,
tidy = TRUE,
collapse = TRUE)
library("tidyverse")
```
We can use statistics to [describe a data set](#basic-data-analysis-with-excel)
and to [estimate the value of some unknown parameter](#estimation). But our
data typically provide limited evidence on the true value of any parameter
of interest: estimates are subject to sampling error of an unknown direction and
magnitude. We want to have a way of accounting for sampling error and assessing
how strong the evidence is in favor or opposition to a particular claim
about the true state of the world.
This chapter will develop a set of techniques for ***statistical inference***:
instead of providing a single best guess of a parameter's true value, we will
use data to classify particular parameter values as plausible (could be the true
value) or implausible (unlikely to be the true value).
::: {.goals data-latex=""}
**Chapter goals**
In this chapter, we will learn how to:
1. Select a parameter of interest, null hypothesis, and alternative hypothesis
for a hypothesis test.
2. Identify the characteristics of a valid test statistic.
3. Describe the distribution of a simple test statistic under the null and
alternative.
4. Find the size/significance of a simple test.
5. Find critical values for a test of a given size.
6. Implement and interpret a hypothesis test.
7. Construct and interpret a confidence interval.
:::
To prepare for this chapter, please review the chapter on
[statistics](#statistics).
## Questions and evidence
We often analyze data with a specific research question in mind. That is, there
is some statement about the world whose truth we are interested in assessing.
For example, we might want to know:
- Do men earn more than women with similar skills?
- Does increasing the minimum wage reduce employment?
- Do poor economies grow faster or slower than rich ones?
Sometimes the data allow us to answer these questions decisively, sometimes not.
That is, the strength of our evidence can vary. The aim of statistical inference
is to give us a clear and rigorous way of thinking about the strength of
evidence, and a systematic way of setting a standard of evidence for reaching a
particular conclusion.
::: example
**Fair and unfair roulette games**
Suppose you work as a casino regulator for the BCLC (British Columbia Lottery
Corporation, the crown corporation that regulates all commercial gambling in
B.C.). You have been given data with recent roulette results from a particular
casino and are tasked with determining whether the casino is running a fair
game.
Before getting caught up in math, let's think about how we might assess
evidence:
1. A fair game implies a particular win probability for each bet.
- For example, the win probability for a bet on red will be
$18/37 \approx 0.486$ in a fair game.
2. The law of large numbers implies that the win rate over many games will be
close to the win probability, but the win rate and win probability are
unlikely to be identical in a finite sample.
- In 100 games, we would expect red to win about 48 or 49 times in a fair
game.
- But these are games of chance; even in a fair game, red may win a little
less than 48 times or a little more than 49 times.
3. In a given data set:
- We might have results from many games, or only a few games.
- Our results may have a win rate close to the expected rate for a fair
game, or far from that rate.
We can put those possibilities into a table, and make an assessment of what we
might conclude from a given data set:
| Win rate | Many games | Few games |
|:------------------------|:---------------:|:-----------------------:|
| Close to fair game rate | Probably fair | Could be fair or unfair |
| Far from fair game rate | Probably unfair | Could be fair or unfair |
That is, we can make a fairly confident conclusion if we have a lot of evidence,
and our conclusion depends on what the evidence shows. But if we do not have a
lot of evidence, we cannot make a confident conclusion either way.
This chapter will formalize these basic ideas about evidence.
:::
## Hypothesis tests
We will start with hypothesis tests. The idea of a hypothesis test is to
determine whether the data rule out or ***reject*** a specific value of the
unknown parameter of interest $\theta$.
A hypothesis test consists of the following components:
1. A null hypothesis $H_0$ and alternative hypothesis $H_1$ about the parameter
of interest $\theta$.
2. A test statistic $t_n$ that can be calculated from the data.
3. A pair of critical values $c_L$ and $c_H$, such that the null hypothesis will
be rejected if $t_n$ is not between $c_L$ and $c_H$.
We will go through each of these components in detail.
### Data and DGP
For the remainder of this chapter, suppose we have a data set $D_n$ of size
$n$. The data comes from an unknown data generating process $f_D$.
::: example
**Data and DGP for roulette**
Let $D_n = (x_1,\ldots,x_n)$ be a data set of results from $n = 100$ games of
roulette at a local casino. More specifically,let:
$$x_i = I(\textrm{Red wins})$$
We will consider two cases:
| Case number | Wins by red (out of 100) | $\bar{x}$ | $s_x$ |
|:------------|:------------------------:|:---------:|:-------:|
| $1$ | $35$ | $0.35$ | $0.479$ |
| $2$ | $40$ | $0.40$ | $0.492$ |
:::
### The null and alternative hypotheses
The first step in a hypothesis test is to identify the parameter of interest
and define the ***null hypothesis***. The null hypothesis is a statement about
the parameter of interest $\theta$ that takes the form:
$$H_0: \theta = \theta_0$$
where $\theta = \theta(f_D)$ is the parameter of interest and $\theta_0$ is a
specific value we are interested in ruling out.
The next step is to define the ***alternative hypothesis***, which is every
other value of $\theta$ we are willing to consider. In this course, the
alternative hypothesis will always be:
$$H_1: \theta \neq \theta_0$$
where $\theta_0$ is the same number as used in the null.
::: example
**Null and alternative for roulette**
In our roulette example, the parameter of interest is the win probability for
red:
$$p_{red} = \Pr(x_i = 1)$$
The null hypothesis is that the game is fair:
$$H_0: p_{red} = 18/37$$
and the alternative hypothesis is that it is not fair:
$$H_1: p_{red} \neq 18/37$$
I am expressing the fair win probability as a fraction to minimize rounding
error in subsequent calculations.
:::
::: {.fyi data-latex=""}
**What null hypothesis to choose?**
Our framework here assumes that you already know what null hypothesis you wish
to test, but we might briefly consider how we might choose a null hypothesis to
test.
In some applications, the research question leads to a natural null hypothesis:
- The natural null in our roulette example is to test is whether the win
probability matches that of a fair game ($p = 18/37$).
- When measuring the effect of one variable on another, the natural null to test
is "no effect at all" ($\theta = 0$).
- In epidemiology, a contagious disease will tend to spread if its reproduction
rate $R$ is greater than one, and decline if it is less than one, so the
natural null to test is $R = 1$.
If there is no obvious null hypothesis, it may make sense to test many null
hypotheses and report all of the results.
:::
### The test statistic
Our next step is to construct a ***test statistic*** that can be calculated from
our data. A valid test statistic for a given null hypothesis is a statistic
$t_n$ that has the following two properties:
1. The probability distribution of $t_n$ ***under the null*** (i.e., when $H_0$
is true) is *known*.
2. The probability distribution of $t_n$ ***under the alternative*** (i.e., when
$H_1$ is true) is *different* from its probability distribution under the
null.
It is not easy to come up with a valid test statistic, so that is typically a
job for a professional statistician. But I want you to understand the basic idea
of what a test statistic is, and to be able to tell whether a proposed test
statistic is valid or not.
::: example
***A test statistic for roulette***
Since a fair game has win probability $18/37 \approx 0.486$ we would expect
about 48 or 49 wins in 100 fair games. So a natural test statistic for
determining whether the game is fair is the number of wins:
$$t_n = n\hat{f}_{red} = n\bar{x}_n =\sum_{i=1}^n x_i$$
Next we need to find the probability distribution of $t_n$ under the null, and
under the alternative.
We earlier learned about the binomial distribution, which is the distribution
of the number of times an event with probability $p$ happens in $n$ independent
trials. Since each $x_i$ in our data is an independent $Bernoulli(p_{red})$
random variable, the number of wins is binomial:
$$t_n \sim Binomial(100,p_{red})$$
Under the null (when $H_0$ is true), $p_{red} = 18/37$ and so:
$$H_0 \quad \implies \qquad t_n \sim Binomial(100,18/37)$$
Since this distribution does not involve any unknown parameters, our test
statistic satisfies the requirement of having a *known* distribution under the
null.
Under the alternative (when $H_1$ is true), $p_{red}$ can take on any value
*other* than $18/37$. The sample size is still $n=100$, so the distribution of
the test statistic is:
$$H_1 \quad \implies \qquad t_n \sim Binomial(100,p_{red}) \textrm{ where $p_{red} \neq 18/37$ }$$
Notice that the distribution of our test statistic under the alternative is not
known, since $p_{red}$ is not known. But the distribution is *different* under
the alternative, and that is what we require from our test statistic.
:::
### Critical values
After choosing a test statistic $t_n$ and determining its distribution under the
null, the next step is to choose ***critical values***. The critical values of
a test are two numbers $c_L$ and $c_H$ (where $c_L < c_H$) such that:
1. $t_n$ has a *high* probability of being between $c_L$ and $c_H$ when the null
is true.
2. $t_n$ has a *lower* probability of being between $c_L$ and $c_H$ when the
alternative is true.
The range of values from $c_L$ to $c_H$ is called the ***critical range*** of
our test.
Given the test statistic and critical values:
- We ***reject the null*** if $t_n$ is outside of the critical range.
- This means we have clear evidence that $H_0$ is false.
- The reason we reject here is that we know we would be unlikely to observe
such a value of $t_n$ if $H_0$ were true.
- We ***fail to reject the null*** or ***accept the null*** if $t_n$ is inside
of the critical range.
- This means we do not have clear evidence that $H_0$ is false.
- This does not mean we have clear evidence that $H_0$ is true. We may just
not have enough evidence to tell whether it is true or false.
I usually avoid saying "accept the null" because it can be misleading.

How do we choose critical values? You can think of critical values as setting a
standard of evidence, so we need to balance two considerations:
- The probability of rejecting a false null is called the ***power*** of the
test.
- We want to reject false nulls, so power is good.
- The probability of rejecting a true null is called the ***size*** or
***significance*** of a test.
- We do not want to reject true nulls, so size is bad.
- There is always a trade off between power and size
- A *narrower* critical range (higher $c_L$ or lower $c_H$) produces more
rejections, increasing both power (good) and size (bad).
- A *wider* critical range (lower $c_L$ or higher $c_H$) produces fewer
rejections, reducing both power (bad) and size (good).
Given this trade off between power and size, we could construct some criterion
that accounts for both (just like MSE includes both variance and bias) and
choose critical values to maximize that criterion. But we don't do that.
Instead, we follow a simple convention:
1. Set the size to a fixed value $\alpha$.
- The convention in economics and most other social sciences is to use a size
of 5\% ($\alpha = 0.05$).
- Economists may use 1\% ($\alpha = 0.01$) when working with larger data sets
or 10\% ($\alpha = 0.10$) when working with smaller data sets.
- The data sets in physics or genetics are much larger, and they use a much
lower conventional size.
2. Calculate critical values that imply the desired size.
- With a size of 5\% $(\alpha = 0.05)$, we would:
- Set $c_L$ to the 2.5 percentile (0.025 quantile) of the null
distribution.
- Set $c_H$ to the 97.5 percentile (0.975 quantile) of the null
distribution.
- With a size of 10\% $(\alpha = 0.10)$, we would:
- Set $c_L$ to the 5 percentile (0.05 quantile) of the null distribution.
- Set $c_H$ to the 95 percentile (0.95 quantile) of the null distribution.
- More generally, with a size of $\alpha$, we would:
- Set $c_L$ to the $\alpha/2$ quantile of the null distribution.
- Set $c_H$ to the $1-\alpha/2$ quantile of the null distribution.
Note that we are dividing the size by two so we can put half on the lower tail
of the null distribution and half on the upper table.
::: example
**Critical values for roulette**
We earlier showed that the distribution of $t_n$ under the null is:
$$t_n \sim Binomial(100,18/37)$$
We can get a size of 5\% by choosing:
$$c_L = 2.5 \textrm{ percentile of } Binomial(100,18/37)$$
$$c_H = 97.5 \textrm{ percentile of } Binomial(100,18/37)$$
We can then use Excel or R to calculate these critical values. In Excel, the
function you would use is `BINOM.INV()`
- The formula to calculate $c_L$ is `=BINOM.INV(100,18/37,0.025)`
- The formula to calculate $c_H$ is `=BINOM.INV(100,18/37,0.975)`
The calculations below were done in R:
```{r BinomialCriticalValues}
cat("2.5 percentile of binomial(100,18/37) =",
qbinom(0.025,100,18/37),
"\n")
cat("97.5 percentile of binomial(100,18/37) =",
qbinom(0.975,100,18/37),
"\n")
```
In other words we reject the null (at 5\% significance) that the roulette wheel
is fair if red wins fewer than 39 games or more than 58 games.
:::
::: {.fyi data-latex=""}
**A general test for a single probability**
We can generalize the test we have constructed so far to the case of the
probability of any event:
| Test component | Roulette example |General case |
|:-----------------------|:------------------------------------:|:-------------------------:|
| Parameter | $p_{red} = \Pr(\textrm{Red wins})$ | $p = \Pr(\textrm{event})$ |
| Null hypothesis | $H_0:p_{red} = 18/37$ | $H_0:p = p_0$ |
| Alternative hypothesis | $H_1: p_{red} \neq 18/37$ |$H_1: p \neq p_0$ |
| Test statistic | $t = n\hat{f}_{RED}$ | $t = n\hat{f}_{\textrm{event}}$ |
| Null distribution | $Binomial(100,18/37)$ |$Binomial(n,p_0)$ |
| Critical value $c_L$ | 39 | 2.5 percentile of $Binomial(n,p_0)$ |
| Critical value $c_H$ | 58 | 97.5 percentile of $Binomial(n,p_0)$ |
| Decision | Reject if $t \notin [39,58]$ | Reject if $t \notin [c_L,c_H]$ |
:::
### Size and power
As mentioned above, the size of a test is the probability of rejecting a true
null. It is a single number, since the distribution of the test statistic
is known when the null is true. It is determined by our choice of critical
values, or more precisely we choose critical values to achieve a particular
size.
The power of a test is defined as the probability of rejecting the null when it
is false, and is also determined by our choice of critical values. However,
it is a *function* of the true parameter value $\theta$ rather than a single
number:
$$power(\theta) = \Pr(\textrm{reject $H_0$})$$
The reason for this is that there is only one $\theta$ value that is consistent
with the null, but there are many that are consistent with the alternative.
In some cases we can actually calculate the power function and plot it as a
***power curve***. The details of power calculations are beyond the scope of
this course, but we can at least view and interpret a power curve.
::: example
**The power curve for roulette**
Power curves can be tricky to calculate, and I will not ask you to calculate
them for this course. But they can be calculated, and it is useful to see what
they look like.
Figure \@ref(fig:PowerCurves) below depicts the power curve for the roulette
test we have just constructed; that is, we are testing the null that
$p_{red} = 18/37$ at a 5\% size. The blue line depicts the power curve for
$n=100$ as in our example, while the orange line depicts the power curve for
$n=20$.
```{r PowerCurves, fig.cap = "*Power curves for the roulette example*"}
PowerCurveData <- tibble(theta = seq(0,1,length.out=380),
power20 = pbinom(qbinom(0.025,
20,
18/37),
20,
theta) +
(1 - pbinom(qbinom(0.975,
20,
18/37),
20,
theta)),
power100 = pbinom(qbinom(0.025,
100,
18/37),
100,
theta) +
(1 - pbinom(qbinom(0.975,
100,
18/37),
100,
theta)))
ggplot(data = PowerCurveData,
mapping = aes(x = theta,y = power20)) +
geom_line(col="#EB6E1F") +
geom_line(aes(y = power100), col="#002D62") +
xlab("true probability of winning") +
ylab("power") +
geom_text(label="n=100",x=0.7,y=0.8,col="#002D62") +
geom_text(label="n=20",x=0.9,y=0.8,col="#EB6E1F") +
geom_hline(yintercept = 0.05,col="gray") +
geom_vline(xintercept = 18/37,col="gray") +
labs(title = "Power curve for fair roulette wheel",
subtitle = "",
caption = "H0: Pr(red wins) = 18/37, significance = 0.05",
tag = "")
```
There are a few features I would like you to notice, all of which are common to
most regularly used tests:
- Power reaches its lowest value near the point $(18/37,0.05)$.
Note that $18/37$ is the parameter value under the null, and $0.05$ is the
size of the test. In other words:
- The power of this test is typically greater than its size.
- We are more likely to reject the null when it is false than when it is true.
- A test has this desirable property is called an ***unbiased*** test.
- Power increases as the true $p_{red}$ gets further from the null.
- We are more likely to detect unfairness in a game that is *very* unfair than
when in one that is *a little* unfair.
- Power also increases with the sample size;
- The blue line ($n = 100$) is above the orange line ($n = 20$).
- As $n \rightarrow \infty$, power goes to one for every value in the
alternative. A test with this desirable property is called a
***consistent*** test.
Power analysis is often used by researchers to determine how much data to
collect. Each additional observation collected increases power but costs money.
With limited resources, it is important to spend enough to get clear results,
but not much more than that.
:::
::: {.fyi data-latex=""}
**P values**
The convention of always using a 5\% significance level for hypothesis tests is
somewhat arbitrary and has some negative unintended consequences:
1. Sometimes a test statistic falls just below or just above the critical
value, and small changes in the analysis can change a result from reject
to cannot-reject.
2. In many fields, unsophisticated researchers and journal editors
misinterpret "cannot reject the null" as "the null is true."
One common response to these issues is to report what is called the
***p-value*** of a test. The p-value of a test is defined as the significance
level at which one would switch from rejecting to not-rejecting the null. For
example:
- If the p-value is 0.43 (43\%) we would not reject the null at 10\%, 5\%, or
1\%.
- If the p-value is 0.06 (6\%) we would reject the null at 10\% but not at 5\%
or 1\%.
- If the p-value is 0.02 (2\%) we would reject the null at 10\% and 5\% but not
at 1\%.
- If the p-value is 0.001 (0.1\%) we would reject the null at 10\%, 5\%, and
1\%.
The p-value of a test is simple to calculate from the test statistic and its
distribution under the null. I won't go through that calculation here.
:::
### Implementing and interpreting
So far, we have discussed how to construct a hypothesis test from scratch. But
most of the time statisticians use off-the-shelf test statistics and critical
values, so the main task of a person working with data is implementing the test.
Here are the steps:
1. Choose the null and alternative hypothesis.
- These depend on your research question, so you must choose them yourself.
2. Choose the size of your test.
- In economics, it is usually 5\%.
3. Construct or look up an appropriate test. A test consists of:
- A test statistic
- Critical values (for the chosen size)
4. Calculate the test statistic.
5. Compare the test statistic to the critical values and make an accept/reject
decision.
Always remember that failing to reject the null does not mean the null is true.
::: example
***Implementing our roulette test***
To review the roulette example, the null hypothesis is that $p_{red} = 18/37$
The test statistic is the absolute win frequency $t_n = n\bar{x}$. We want
the test to have 5\% significance, which implies critical values of
$c_L = 39$ and $c_H = 58$.
Suppose that red wins in 35 of the 100 games. Do we have a fair game?
- The test statistic is $t_n = 35$, which is *outside* of the critical range
of $[39,58]$.
- We therefore *reject* the null hypothesis of a fair game.
- That means we have clear evidence that the game is unfair.
Alternatively, suppose that red wins in 40 of the 100 games. Do we have a
fair game?
- The test statistic is $t_n = 40$, which is *inside* the critical range of
$[39,58]$.
- We therefore *fail to reject* the null hypothesis of a fair game.
- That means we do not have clear evidence that the game is unfair.
Remember that failing to reject the null does not mean the null is true. It
is still possible that the game is unfair, we just don't have clear evidence
that it is.
:::
## The central limit theorem
In order for a test statistic to work, its exact probability distribution must
be known under the null hypothesis. The example test in the previous section
worked because it was based on a sample frequency, a statistic whose
probability distribution is relatively easy to calculate. Unfortunately, most
statistics do not have a probability distribution that is easy to calculate.
Fortunately, we have a very powerful asymptotic result called the
***Central Limit Theorem (CLT)***. The CLT roughly says that we can approximate
the entire probability distribution of the sample average $\bar{x}_n$ by a
normal distribution if the sample size is sufficiently large.
::: {.fyi data-latex=""}
**The Central Limit Theorem**
As with the LLN, we need to invest in some terminology before we can state the
CLT.
Let $s_n$ be a statistic calculated from $D_n$ and let $F_n(\cdot)$ be its CDF.
We say that $s_n$ ***converges in distribution*** to a random variable $s$ with
CDF $F(\cdot)$, or:
$$s_n \rightarrow^D s$$
if:
$$\lim_{n \rightarrow \infty} |F_n(a) - F(a)| = 0$$
for every $a \in \mathbb{R}$.
Convergence in distribution means we can approximate the actual CDF $F_n(\cdot)$
of $s_n$ with its limit $F(\cdot)$. As with most approximations, this is useful
whenever $F_n(\cdot)$ is difficult to calculate and $F(\cdot)$ is easy to
calculate.
We can now state the theorem:
**CENTRAL LIMIT THEOREM**: Let $\bar{x}_n$ be the sample average from a random
sample of size $n$ on the random variable $x_i$ with mean $E(x_i) = \mu_x$ and
variance $var(x_i) = \sigma_x^2$. Let $z_n$ be a standardization of $\bar{x}$:
\begin{align}
z_n = \sqrt{n} \frac{\bar{x} - \mu_x}{\sigma_x}
\end{align}
Then $z_n \rightarrow^D z \sim N(0,1)$.
:::
What does the central limit theorem mean?
- Fundamentally, it means that if $n$ is big enough then the probability
distribution of $\bar{x}_n$ is approximately normal
*no matter what the original distribution of $x_i$ looks like*.
- In order for the CLT to apply, we need to re-scale $\bar{x}_n$ so that it
has zero mean (by subtracting $E(\bar{x}_n) = \mu_x$) and constant variance
as $n$ increases (by dividing by $sd(\bar{x}_n) = \sigma_x/\sqrt{n}$)). That
re-scaled sample average is $z_n$.
- In practice, we don't usually know $\mu_x$ or $\sigma_x$ so we can't
calculate $z_n$ from data. Fortunately, there are some tricks for getting
around this problem that we will talk about later.
What about statistics other than the sample average? Well it turns out that
Slutsky's theorem also extends to convergence in distribution, which means
that the central limit theorem applies more broadly, and most statistics are
asymptotically normal just like the sample average.
::: {.fyi data-latex=""}
**Slutsky's theorem for probability distributions**
We earlier stated Slutsky's theorem for convergence in probability, which
allowed us to extend the law of large numbers to most statistics. There is also
a version of Slutsky's theorem for convergence in distribution:
**SLUTSKY THEOREM**: Let $g(\cdot)$ be a continuous function. Then:
$$s_n \rightarrow^D s \implies g(s_n) \rightarrow^D g(s)$$
This version of Slutsky's theorem will allow us to extend the central limit
theorem to most statistics.
:::
## Inference on the mean
Having described the general framework and a single example, we now move on to
the most common application: constructing hypothesis tests and confidence
intervals on the mean in a random sample.
Let $D = (x_1,\ldots,x_n)$ be a random sample of size $n$ on some random
variable $x_i$ with unknown mean $E(x_i) = \mu_x$ and variance
$var(x_i) = \sigma_x^2$. Let the sample average be
$\bar{x}_n = \frac{1}{n} \sum_{i=1}^n x_i$, let the sample variance be
$sd_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$ and let the sample
standard deviation be $sd_x = \sqrt{sd_x^2}$.
::: example
**The mean and sample average in the roulette data**
Previously, we developed an exact frequency-based test for the fairness of a
roulette table. We can also fit that research question into the mean-based
framework of this section.
In our roulette data, the expected value of $x_i$ is also the win probability:
\begin{align}
\mu_x &= E(x_i) = p_{red}
\end{align}
so any hypothesis about $p_{red}$ can also be expressed equivalently in terms of
$\mu_x$. Similarly, the sample average is also the win frequency.
:::
### The null and alternative hypotheses
Suppose that you want to test the null hypothesis:
$$H_0: \mu_x = \mu_0$$
against the alternative hypothesis:
$$H_1: \mu_x \neq \mu_0$$
where $\mu_0$ is a number that has been chosen to reflect the research question.
::: example
**Null and alternative hypotheses for the mean in roulette**
The null hypothesis of a fair table can be expressed in terms of $\mu_x$:
$$H_0: \mu_x = 18/37$$
against the alternative hypothesis:
$$H_1: \mu_x \neq 18/37$$
i.e., $\mu_0 = 18/37$.
:::
### The T statistic
Having stated our null and alternative hypotheses, we need to construct a test
statistic.
The typical test statistic we use in this setting is called the
***T statistic***, and takes the form:
$$t_n = \frac{\bar{x}_n - \mu_0}{sd_x/\sqrt{n}}$$
The idea here is that we take our estimate of the parameter ($\bar{x}_n$),
subtract its expected value under the null ($\mu_0$), and divide by an estimate
of its standard deviation ($s_x/\sqrt{n}$).
::: example
***The T statistic in roulette***
If red wins in 35 of the 100 games, then $\bar{x} = 0.35$ and
$sd_x \approx 0.479$. So the T statistic for our test is:
\begin{align}
t_n &= \frac{\bar{x}-\mu_0}{sd_x/\sqrt{n}} \\
&\approx \frac{0.35 - 18/37}{0.479/\sqrt{100}} \\
&\approx -2.84
\end{align}
If red wins in 40 of the 100 games, then $\bar{x} = 0.40$ and
$sd_x \approx 0.492$. So the T statistic for our test is:
\begin{align}
t_n &= \frac{\bar{x}-\mu_0}{sd_x/\sqrt{n}} \\
&\approx \frac{0.40 - 18/37}{0.492/\sqrt{100}} \\
&\approx -1.75
\end{align}
Note that the value of $\mu_x$ under the null is $\mu_0=18/37$.
:::
### Exact and approximate tests
Next we need to show that this test statistic has a known distribution under
the null and a different distribution under the alternative. We can do some
algebra to get:
\begin{align}
t_n &= \frac{\bar{x}_n + (\mu_x - \mu_x) - \mu_0}{sd_x/\sqrt{n}} \\
&= \frac{\bar{x}_n - \mu_x}{sd_x/\sqrt{n}} + \frac{\mu_x - \mu_0}{sd_x/\sqrt{n}} \\
&= \frac{\bar{x}_n - \mu_x}{sd_x/\sqrt{n}} \frac{\sigma_x}{\sigma_x}
+ \frac{\mu_x - \mu_0}{sd_x/\sqrt{n}} \\
&= \underbrace{\frac{\bar{x}_n - \mu_x}{\sigma_x/\sqrt{n}}}_{z_n}
\underbrace{\frac{\sigma_x}{sd_x}}_{\textrm{?}}
+ \underbrace{\sqrt{n} \frac{\mu_x - \mu_0}{sd_x}}_{\textrm{$=0$ if $H_0$ is true}}
\end{align}
Let's take a look at the components of this expression:
1. The first term $z_n = \frac{\bar{x}_n - \mu_x}{\sigma_x/\sqrt{n}}$ is a standardization
of $\bar{x}_n$. By construction it has the following properties:
- Mean zero: $E(z_n) = 0$.
- Unit variance: $var(z_n) = sd(z_n) = 1$.
- The central limit theorem applies: $z_n \rightarrow^D N(0,1)$.
2. The second term $\frac{\sigma_x}{sd_x}$ features the standard deviation
($\sigma_x$) divided by a consistent estimator of the standard deviation
($sd_x$).
- In a given sample, this will be almost but not quite equal to one.
3. The third term $\sqrt{n} \frac{\mu_x - \mu_0}{s_x}$ features a positive
number that is growing to infinity as the sample size increases
($\sqrt{n}$) times a number that is zero if the null is true and nonzero
if the null is false ($\mu_x - \mu_0$), divided by a positive random
variable ($s_x$).
- When the null is true, this term is zero.
- When the null is false, this term is nonzero and will be large if the
sample is large.
Recall that we need the probability distribution of $t_n$ to be known when $H_0$
is true, and different when it is false. The second criterion is clearly met,
and the first criterion is met if we can find the probability distribution of
$\frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}}$.
The frequency-based test we derived in Section \@ref(hypothesis-tests) is what
statisticians call an ***exact test***: critical values are based on the actual
distribution of the test statistic under the null. An exact test was possible
in this case because the structure of the problem implied that the win count
must have a binomial distribution.
Unfortunately, an exact test based on the T statistic is only possible if we
know the exact probability distribution of $x_i$, and can then use that
probability distribution to derive the exact probability distribution of $t_n$.
There are two standard solutions to this problem, both of which are based on
approximating an exact test:
1. ***Parametric test***: Assume a specific probability distribution (usually
a normal distribution) for $x_i$. We can (or at least a professional
statistician can) then mathematically derive the distribution of any test
statistic from this distribution.
2. ***Asymptotic test***: Use the central limit theorem to get an approximate
probability distribution for the test statistic.
We will explore both of those options.
### Asymptotic critical values
We will start with the asymptotic solution to the problem. The Central Limit
Theorem tells us that:
$$\frac{\bar{x}_n - \mu_x}{\sigma_x/\sqrt{n}} \rightarrow^D N(0,1)$$
Under the null our test statistic looks just like this, but with the sample
standard deviation $s_x$ in place of the population standard deviation
$\sigma_x$. It turns out that Slutsky's theorem allows us to make this
substitution, and it can be proved that:
$$\frac{\bar{x}_n - \mu_x}{s_x/\sqrt{n}} \rightarrow^D N(0,1)$$
Therefore, the null implies that $t_n$ is asymptotically normal:
$$(H_0:\mu_x = \mu_0) \qquad \implies \qquad t_n \rightarrow^D N(0,1)$$
In other words, we do not know the exact (finite-sample) distribution of $t_n$
under the null, but we know that $N(0,1)$ provides a useful asymptotic
approximation to that distribution.
```{r NormalDistribution, fig.cap = "*Asymptotic distribution of t_n under the null*"}
simdata <- tibble(x = seq(-3,3,length.out = 100),
tinf = dnorm(x))
ggplot(data=simdata,
mapping = aes(x= x,
y= tinf)) +
geom_line() +
xlab("value") +
ylab("PDF of t_n") +
labs(title = "Asymptotic distribution of t_n under null",
subtitle = "",
caption = "",
tag = "")
```
Therefore, if we want a test that has the ***asymptotic size*** of 5\%, we can
use Excel or R to calculate critical values based on the standard normal
distribution. In Excel, the function would be `NORM.INV()` or `NORM.S.INV()`,
and the formulas would be:
- $c_L$: `=NORM.S.INV(0.025)` or `=NORM.INV(0.025,0,1)`.
- $c_H$: `=NORM.S.INV(0.975)` or `=NORM.INV(0.975,0,1)`.
The calculations below were done in R:
```{r AsymptoticCriticalValues}
cat("cL = 2.5 percentile of N(0,1) = ",
round(qnorm(0.025),3),
"\n")
cat("cH = 97.5 percentile of N(0,1) = ",
round(qnorm(0.975),3),
"\n")
```
These particular critical values are so commonly used that I want you to
remember them.
::: example
***The asymptotic test for roulette***
We have calculated above that the 5\% asymptotic critical values for our
roulette test are $c_L = -1.96$ and $c_H = 1.96$.
If red wins in 35 of the 100 games, the test statistic is $t_n = -2.84$. This is
*outside* of the critical range, so we reject the null of a fair game.
If red wins in 40 of the 100 games, the test statistic is $t_n = -1.75$. This is
*inside* of the critical range, so we fail to reject the null of a fair game.
:::
### Parametric critical values
Most economic data comes in sufficiently large samples that the asymptotic
distribution of $t_n$ is a reasonable approximation and the asymptotic test
works well. But occasionally we have samples that are small enough that it
doesn't.
Another option is to assume that the $x_i$ variables are normally distributed:
$$x_i \sim N(\mu_x,\sigma_x^2)$$
where $\mu_x$ and $\sigma_x^2$ are unknown parameters. Keep in mind that many
interesting variables are *not* normally distributed, so the assumption that
$x_i$ is normally distributed is not necessarily appropriate in every setting.
::: example
**Normality in the roulette data?**
In our roulette data, $x_i$ has a Bernoulli distribution and could not possibly
be normally distributed.
:::
The null distribution of the test statistic
$t_n = \frac{\bar{x}-\mu_0}{s_x/\sqrt{n}}$ under these particular assumptions
was derived in the 1920's by William Sealy Gosset, a statistician working at the
Guinness brewery. To avoid getting in trouble at work (Guinness did not want to
give away trade secrets) Gosset published under the pseudonym "Student". As a
result, the family of distributions he derived is called
"Student's T distribution". Gosset's calculations are beyond the scope of this
course. But you should understand that the distribution of this particular
test statistic *can* be derived once we assume normality of the $x_i$, and to
know how to calculate its quantiles or critical values using Excel.
When the null is true, the test statistic
$t_n = \frac{\bar{x}-\mu_0}{s_x/\sqrt{n}}$ has the Student's T distribution with
$n-1$ degrees of freedom:
$$t_n \sim T_{n-1}$$
and when the null is false, it has a different distribution which is sometimes
called the "noncentral T distribution."
The $T_{n-1}$ distribution looks a lot like the $N(0,1)$ distribution, but has
slightly higher probability of extreme positive or negative values (a
statistician would say the distribution has "fatter tails"). As $n$ increases,
the extreme values become less common and the $T_{n-1}$ distribution converges
to the $N(0,1)$ distribution as predicted by the central limit theorem.
```{r TDistribution, fig.cap = "*Exact distribution of t_n under the null*"}
simdata <- tibble(x = seq(-3,3,length.out = 100),
t5 = dt(x,df=4),
t10 = dt(x,df=9),
t30 = dt(x,df=29),
tinf = dnorm(x))
ggplot(data = simdata,
mapping = aes(x = x)) +
geom_line(aes(y = t5), col = "orange") +
geom_line(aes(y = t10), col = "blue") +
geom_line(aes(y = t30), col = "maroon") +
geom_line(aes(y = tinf), col = "black") +
geom_text(label="n = infinity (N(0,1) distribution)",x=1.2,y=0.4,col="black") +
geom_text(label="n = 5 (T_4 distribution)",x=2.5,y=0.1,col="orange") +
geom_text(label="n = 10 (T_9 distribution)",x=2.1,y=0.2,col="blue") +
geom_text(label="n = 30 (T_29 distribution)",x=1.7,y=0.3,col="maroon") +
xlab("value") +
ylab("PDF") +
labs(title = "Exact distribution of t_n under null",
subtitle = "",
caption = "(x assumed to be normally distributed)",
tag = "")
```
Having found our test statistic and its distribution under the null, we can
calculate our critical values:
$$c_L = 2.5 \textrm{ percentile of } T_{n-1}$$
$$c_H = 97.5 \textrm{ percentile of } T_{n-1}$$
We can obtain these percentiles using Excel or R. In Excel, the relevant
function is `T.INV`.
::: example
**Calculating critical values for the $T$ distribution**
If we have $n = 5$ observations, then:
- We would calculate $c_L$ by the formula `=T.INV(0.025,5-1)`.
- We would calculate $c_H$ by the formula `=T.INV(0.975,5-1)`.
The results (calculated below using R) would be:
```{r T4CriticalValues}
cat("cL = 2.5 percentile of T_4 = ",
round(qt(0.025,df=4),3),
"\n")
cat("cH = 97.5 percentile of T_4 = ",
round(qt(0.975,df=4),3),
"\n")
```
In contrast, if we have 30 observations, then:
- We would calculate $c_L$ by the formula `=T.INV(0.025,30-1)`.
- We would calculate $c_H$ by the formula `=T.INV(0.975,30-1)`.
The results (calculated below using R) would be:
```{r T29CriticalValues}
cat("cL = 2.5 percentile of T_29 = ",
round(qt(0.025,df=29),3),
"\n")
cat("cH = 97.5 percentile of T_29 = ",
round(qt(0.975,df=29),3),
"\n")
```
and if we have 1,000 observations:
- We would calculate $c_L$ by the formula `=T.INV(0.025,1000-1)`.
- We would calculate $c_H$ by the formula `=T.INV(0.975,1000-1)`.
The results (calculated below using R) would be:
```{r T999CriticalValues}
cat("cL = 2.5 percentile of T_999 = ",
round(qt(0.025,df=999),3),
"\n")
cat("cH = 97.5 percentile of T_999 = ",
round(qt(0.975,df=999),3),
"\n")
```
Notice that with 1,000 observations the finite-sample critical values are nearly
identical to the asymptotic critical values.
:::
Once we have calculated critical values, all that remains is to implement the
test.
::: example
**A parametric test for roulette**
As mentioned earlier, our roulette data are definitely *not* normally
distributed. But suppose we do not realize this, and assume normality anyway.
Since we have 100 observations, this normality assumption implies that our test
statistic $t_n = \frac{\bar{x}-\mu_0}{s_x/\sqrt{n}}$ has a Student's T
distribution with 99 degrees of freedom:
\begin{equation}
H_0 \qquad \implies \qquad t_n \sim T_{99}
\end{equation}
We can then calculate critical values for a 5\% test:
```{r T99CriticalValues}
cat("cL = 2.5 percentile of T_99 = ",
round(qt(0.025,df=99),3),
"\n")
cat("cH = 97.5 percentile of T_99 = ",
round(qt(0.975,df=99),3),
"\n")
```
If red wins in 35 of the 100 games, the test statistic is $t_n = -2.84$. This is
*outside* of the critical range, so we reject the null of a fair game.
If red wins in 40 of the 100 games, the test statistic is $t_n = -1.75$. This is
*inside* of the critical range, so we fail to reject the null of a fair game.
:::
### Choosing a test
Statisticians often call the parametric test for the mean the ***T test***
and the asymptotic test the ***Z test***, as a result of the notation typically
used to represent the test statistic. The two tests have the same underlying
test statistic, but different critical values. So which test should we use in
practice?
- For any finite value of $n$, the T test is the more ***conservative*** test.
- It has larger critical values than the Z test.
- It is less likely to reject the null.
- It has lower power and lower size.
- At some point (around $n = 30$) the difference between the two tests becomes
too small to make much of a difference.
- In the limit (as $n \rightarrow \infty$) the two tests are equivalent.
As a result, statisticians typically recommend using the T test for smaller
samples (less than 30 or so), and then using whichever test is more convenient
with larger samples. Most data sets in economics have well over 30
observations, so economists tend to use asymptotic tests unless they have a very
small sample.
::: example
**Choosing a test for roulette**
We have developed three tests for the fairness of a roulette game:
1. An exact test based on the win count and the binomial distribution.
2. A parametric test based on the T statistic and the Student's T distribution.
3. An asymptotic test based on the T statistic and the standard normal
distribution.
In a purely technical sense, the exact test is preferable: it is based on the
true distribution of the test statistic under the null, while the other two
tests are based on approximations. But it is more difficult to implement.
In the end, all three tests produced the same results: we reject the null of a
fair game if red wins 35 times out of 100, and fail to reject that null if red
wins 40 times out of 100. This should make sense, as the three tests are just
slightly different ways of assessing the same evidence. If all three tests are