-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path05-statistical-inference.Rmd
204 lines (163 loc) · 8.13 KB
/
05-statistical-inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
knit: (function(input, ...) {
input <- "index.Rmd";
thesis_formats <- "bs4";
source("scripts_and_filters/knit-functions.R");
knit_thesis(input, thesis_formats, ...)
})
---
# Statistical Inference
::: {.rmdcaution}
**This section is still in development.**
:::
::: {.definition name="Population"}
A _population_ is a set of similar items or events which is of interest for some question or experiment.
:::
::: {.definition name="Sample"}
A _sample_ is a set of individuals or objects collected or selected from a statistical population by a defined procedure.
:::
## Finite Sample Distributions
::: {.definition name="Convergence in Distribution"}
A sequence $\displaystyle X_n$ of random variables is said to _converge in distribution_ to a random variable $X$ if
$$
\lim_{n\to\infty} F_{X_n}(x) = F_X(x)
$$
for every $x \in \R$.
For random vectors $\left\{ X_1, X_2, \dots \right\} \subset \R^k$, we say that this sequence _converges in distribution_ to a random $k$-vector $X$ if
$$
\lim_{n\to\infty} P(X_n\in A) = P(X\in A)
$$
for every $A \subset \R^k$ which is a continuity set of $X$.
:::
::: {.definition name="Convergence in Probability"}
A sequence $\displaystyle X_n$ of random variables _converges in probability_ towards the random variable $X$ if
$$
\lim_{n\to\infty} P\left( \left\lvert X_n - X \right\rvert > \varepsilon \right) = 0
$$
for all $\varepsilon > 0$.
:::
::: {.definition name="Convergence in Mean"}
Given a real number $r \geq 1$, we say that the sequence $X_n$ _converges in the $r$th mean_ or _in the $L^r$ norm_ towards the random variable $X$ if the $r$th absolute moments of $X_n$ and $X$ exist, and
$$
\lim_{n\to\infty} E\left( \left\lvert X_n - X \right\rvert^r \right) = 0.
$$
:::
::: {.theorem name="Law of Large Numbers"}
Let $X_1, X_2, \dots$ be independent and identically distributed random variables with finite mean $\mu$. Let $\bar{X}_n$ be the average of the first $n$ variables, then the _law of large numbers_ establishes that:
1. _Weak_
$$
\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i \underset{n\to\infty}{\longrightarrow} \mu
$$
2. _Weak_
$$
P\left[ \left\lvert \bar{X}_n - \mu \right\rvert > \varepsilon \right] \underset{n\to\infty}{\longrightarrow} 0, \quad \forall \varepsilon > 0
$$
3. _Weak_
$$
P\left[ \left\lvert \bar{X}_n - \mu \right\rvert < \varepsilon \right] \underset{n\to\infty}{\longrightarrow} 1, \quad \forall \varepsilon > 0
$$
4. _Strong_
$$
P\left[ \left\{ \omega \in \Omega : \bar{X}_n(\omega) \underset{n\to\infty}{\longrightarrow} \mu \right\} \right] = 1
$$
:::
::: {.theorem name="Central Limit Theorem"}
Suppose $\left\{X_1, \dots, X_n\right\}$ is a sequence of independent and identically distributed random variables with $E[X_i] = \mu$ and $\Var[X_i] = \sigma^2 < \infty$. Then, as $n$ approaches infinity, the random variables $\sqrt{n}\left( \bar{X}_n - \mu \right)$ converge in distribution to a normal $\Norm(0, \sigma^2)$.
$$
\ie \quad \sqrt{n}\left( \bar{X}_n - \mu \right) \dlim \Norm(0, \sigma^2)
$$
:::
> Let $X_1, \dots, X_n$ be independent and identically distributed random variables drawn according to some distribution $P_\theta$ parametrized by $\theta = (\theta_1, \dots, \theta_m) \in \Theta$, where $\Theta$ is the set of all possible parameters for the selected distribution. The goal is to find the best estimator $\hat{\theta} \in \Theta$ such that $\hat{\theta} \approx \theta$ since the real $\theta$ cannot be known exactly from a finite sample.
::: {.definition name="Estimator"}
An _estimator_ $\hat{\theta}_j$ from a parameter $\theta_j$ is a random variable $\hat{\theta}_j(X_1, \dots, X_n)$ that is symbolized as a function of the observed data.
:::
::: {.definition name="Estimate"}
An _estimate_ $\hat{\theta}_j(x_1, \dots, x_n)$ is a realization of the estimator. It is a real value for the estimated parameter.
:::
::: {.definition name="Bias"}
The _bias_ of an estimator $\hat{\theta}$ is defined as
$$
\Bias (\hat{\theta}, \theta) := E_\theta [\hat{\theta} - \theta].
$$
We say that an estimator is _unbiased_ if $\Bias_\theta [\hat{\theta}] = 0$ or $E_\theta [\hat{\theta}] = \theta$.
:::
::: {.definition name="Mean Squared Error"}
The _mean squared error_ of an estimator $\hat{\theta}$ is defined as
$$
\MSE_\theta[\hat{\theta}] := E\left[ \left( \hat{\theta} - \theta \right)^2 \right] = \Var_\theta[\hat{\theta}] + \Bias^2 (\hat{\theta}, \theta)
$$
:::
::: {.definition name="Consistency"}
A sequence of estimators $\hat{\theta}^{(n)}$ of the parameter $\theta$ is called _consistent_ if, for any $\varepsilon > 0$,
$$
P_\theta \left[ \left\lvert \hat{\theta}^{(n)} - \theta \right\rvert > \varepsilon \right] \underset{n\to\infty}{\longrightarrow} 0.
$$
> An estimator is consistent only if, as the sample data increases, the estimator approaches the real parameter.
:::
::: {.definition name="Relative Efficiency"}
The _relative efficiency_ of two estimators is defined as
$$
e\left(\hat{\theta}_1, \hat{\theta}_2\right) = \frac{\Var\left[\hat{\theta}_2\right]}{\Var\left[\hat{\theta}_1\right]}.
$$
We say that $\hat{\theta}$ is preferable if $\Var\left[\hat{\theta}_1\right] < \Var\left[\hat{\theta}_2\right]$.
:::
## Point Estimation
::: {.definition name="Likelihood Function"}
The _likelihood function_ $\L$ is defined as
$$
\L(\theta;\ x_1, \dots, x_n) = f(x_1, \dots, x_n;\ \theta).
$$
Assuming $x_i \perp x_j$, $\forall i \neq j$,
$$
\L(\theta;\ x_1, \dots, x_n) = \prod_{i=1}^n f(x_i;\ \theta).
$$
> For practical purposes, we often use the _log-likelihood function_ $\ell(\theta;\ x_1, \dots, x_n) = \log\L(\theta;\ x_1, \dots, x_n)$ since it is much easier to differentiate afterwards, and the maximum of $\L$ is preserved for all $\theta_j$.
:::
::: {.definition name="Maximum Likelihood Estimator"}
The _maximum likelihood estimator_ $\hat{\theta}$ for $\theta$ is defined as
$$
\hat{\theta} \in \left\{ \argmax_{\theta \in \Theta} \L(\theta;\ X_1, \dots, X_n) \right\}
$$
:::
::: {.definition name="Score"}
The _score_ is the gradient the natural logarithm of the likelihood function with respect to an $m$-dimensional parameter vector $\theta$.
$$
s(\theta) := \frac{\partial \log \L(\theta)}{\partial\theta}.
$$
> The score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values.
:::
::: {.definition name="Fisher Information"}
Let $f(X;\ \theta)$ be the probability density function or probability mass function for $X$ conditioned on the value of $\theta$. We define the _Fisher information_ as
$$
\mathscr{I}(\theta) := E\left[\left(\frac{\partial}{\partial\theta}\log f(X;\ \theta)\right)^2\right] = - E\left[\frac{\partial^2}{\partial\theta^2}\log f(X;\ \theta)\right].
$$
> The Fisher information is a way of measuring the amount of information that an observable random variable $X$ carries about an unknown parameter $\theta$ upon which the probability of $X$ depends.
:::
::: {.definition name="Cramér–Rao Bound"}
Suppose $\theta$ is an unknown deterministic parameter which is to be estimated from $n$ independent observations of $x$, each from a distribution according to some probability function $f(X;\ \theta)$. The variance of any unbiased estimator $\hat{\theta}$ of $\theta$ is then bounded by the reciprocal of the Fisher information $\mathscr{I}(\theta)$. Namely,
$$
\Var\left[\hat{\theta}\right] \geq \frac{1}{n\mathscr{I}(\theta)}.
$$
:::
::: {.definition name="Efficiency"}
The _efficiency_ of an unbiased estimator $\hat{\theta}$ of a parameter $\theta$ is defined as
$$
e\left( \hat{\theta} \right) = \frac{1 / \mathscr{I}(\theta)}{\Var\left[ \hat{\theta} \right]}
$$
where $\mathscr{I}(\theta)$ is the Fisher information of the sample.
:::
::: {.proposition}
$$
e\left( \hat{\theta} \right) \leq 1
$$
:::
::: {.remark name="Maximum Likelihood Estimator Properties"}
<br/>
* Asymptotically unbiased. Namely, $\displaystyle \lim_{n\to\infty} \Bias\left( \hat{\theta}_n, \theta \right) = 0$.
* Asymptotically efficient. Namely, $\displaystyle \lim_{n\to\infty} \Var\left[ \hat{\theta} \right] = \frac{1}{n\mathscr{I}(\theta)}$.
* Consistency.
* $\displaystyle \hat{\theta}_n \dlim \Norm\left( \theta, \frac{1}{n\mathscr{I}(\theta)} \right)$.
:::
## Interval Estimation
## Parametric Hypothesis Testing
## Normality Tests