04-probability.Rmd

---
knit: (function(input, ...) {
    input <- "index.Rmd";
    thesis_formats <- "bs4";

    source("scripts_and_filters/knit-functions.R");
    knit_thesis(input, thesis_formats, ...)
  })
---

# Probability

::: {.rmdnote}
For a basic course in probability (such as [EST-11101](http://estadistica.itam.mx/sites/default/files/u450/probabilidad.pdf)), you may want to skip some notes on Measure Theory.
:::

## Basics and Combinatorics

::: {.definition name="Sample Space"}
The _sample space_ $\Omega \neq \O$ is the set of all possible outcomes of an experiment. It can be finite or infinite.
:::

::: {.definition name="Event"}
An _event_ $A$ is a subset of the sample space $A \subseteq \Omega$, or an element of the power set of the sample space $\displaystyle A \in 2^\Omega$.
:::

::: {.definition name="Observable Event Set"}
The set of all _observable events_ is denoted by $\F$, where $\displaystyle \F \subseteq 2^\Omega$.
:::

> Usually, if $\Omega$ is countable, $\F = 2^\Omega$. However, sometimes many events are excluded from $\F$ since it is not possible for them to happen.

::: {.definition name="σ-Algebra"}
The set $\F$ is called a _$\sigma$-algebra_ if:

1. $\Omega \in \F$;
2. $\forall A \subseteq \Omega$, $A \in \F$, then $A^C \in \F$; and
3. $\forall (A_n)_{n \in \N}$, $A_n \in \F$, then $\displaystyle \bigcup_{n = 1}^\infty A_n \in \F$.
:::

::: {.definition name="Probability Measure"}
$P : \F \to [0, 1]$ is a _probability measure_ if it satisfies the following three axioms:

1. $\forall A \in \F : P(A) \geq 0$,
2. $P(\Omega) = 1$, and
3. $\displaystyle P\left( \bigcup_{n=1}^\infty A_n \right) = \sum_{n = 1}^\infty P\left(A_n\right)$,

where $A_n$ are disjunct.
:::

::: {.remark}
<br/>

* $\displaystyle P\left(A^C\right) = 1 - P(A)$,
* $P(\O) = 0$,
* if $A \subseteq B$, then $P(A) \leq P(B)$, and
* $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.
:::

::: {.proposition name="De Morgan's Laws"}
Let $A_1, \dots, A_n$ be a set of events.
$$
\left( \bigcup_{i=1}^n A_i \right)^C = \bigcap_{i=1}^n A_i^C \qquad \left( \bigcap_{i=1}^n A_i \right)^C = \bigcup_{i=1}^n A_i^C
$$
:::

::: {.theorem name="Inclusion-Exclusion Principle"}
Let $A_1, \dots, A_n$ be a set of events, then
$$
P\left( \bigcup_{i=1}^n A_i \right) = \sum_{k = 1}^n (-1)^{k-1} S_k,
$$
where
$$
S_k = \sum_{I \subseteq \{1, \dots, n\} \\ \quad \mid I\mid = k} P\left( \bigcap_{i \in I} A_i \right).
$$
:::

::: {.definition name="Laplace Space"}
If $\Omega = \{\omega_1, \dots, \omega_N\}$ with $\mid \Omega \mid = N$ where all $\omega_i$ have the same probability $p_i = \frac{1}{N}$, $\Omega$ is called _Laplace space_ and $P$ has a discrete uniform distribution. For some event $A$, we have
$$
P(A) = \frac{\lvert A\rvert}{\lvert\Omega\rvert}.
$$
:::

> The discrete uniform distribution exists only if $\Omega$ is finite.

::: {.definition name="Conditional Probability"}
Given two events $A$ and $B$ with $P(A) > 0$, the probability of $B$ given $A$ is defined as
$$
P(B \mid A) := \frac{P(B \cap A)}{P(A)}.
$$
:::

::: {.theorem name="Total Probability"}
Let $A_1, \dots, A_n$ be a set of disjunct events $\forall i \neq j : A_i \cap A_j = \O$ where $\displaystyle \bigcup_{i=1}^n A_i = \Omega$, then, for any event $B \subseteq \Omega$,
$$
P(B) = \sum_{i = 1}^n P(B \mid A_i) P(A_i).
$$
:::

::: {.definition name="Bayes' Rule"}
Let $A_1, \dots, A_n$ be the set of disjunct event $\forall i \neq j : A_i \cap A_j = \O$ where $\displaystyle \bigcup_{i=1}^n A_i = \Omega$ with $P(A_i) > 0$ for all $i = 1, \dots, n$, then, for an event $B \subseteq \Omega$ with $P(B) > 0$, we have
$$
P(A_k \mid B) = \frac{P(B \mid A_k) P(A_k)}{\displaystyle\sum_{i=1}^n P(B \mid A_i) P(A_i)}.
$$
:::

::: {.definition name="Independence"}
A set of events $A_1, \dots, A_n$ are _independent_ if, for all $m \in \N$ with $\{k_1, \dots, k_m\} \subseteq 1, \dots, n$, we have
$$
P\left( \bigcap_{i=1}^m A_{k_i} \right) = \prod_{i=1}^m P\left(A_{k_i}\right).
$$
:::

::: {.definition name="Factorial"}
The _factorial_ function is defined by the product
$$
n! = \prod_{i=1}^n i = n \cdot (n - 1)!
$$
for integer $n \geq 1$.
:::

::: {.definition name="Gamma Function"}
Let $z \in \C$ with $\Re(z) > 0$, the _gamma function_ is defined via the following convergent improper integral:
$$
\Gamma(z) = \int_0^\infty t^{z - 1} e^{-t} dt.
$$
:::

::: {.remark name="Gamma Function Properties"}
<br/>

* $\displaystyle \Gamma(1/2) = \sqrt{\pi}$.
* $\displaystyle \Gamma(1) = \Gamma(2) = 1$.
* $\displaystyle \Gamma(z) = (z - 1)\Gamma(z - 1)$.
* $\displaystyle \Gamma(n) = (n - 1)!$, $\forall n \in \N$.
:::

::: {.definition name="Permutation"}
Let $n$ be the number of total objects and $k$ be the number of objects we want to select. A _permutation_ is an arrangement of elements where we care about ordering.

1. Repetition not allowed:
$$
P_n(k) = \frac{n!}{(n - k)!}.
$$
2. Repetition allowed:
$$
P_n(k) = n^k.
$$
:::

::: {.definition name="Combination"}
Let $n$ be the number of total objects and $k$ be the number of objects we want to select. A _combination_ is an arrangement of elements where we do not care about ordering.

1. Repetition not allowed:
$$
C_n(k) = \binom{n}{k} = \frac{P_n(k)}{k!} = \frac{n!}{k!(n - k)!}.
$$
2. Repetition allowed:
$$
C_n(k) = \binom{n + k -1}{k}.
$$
:::

> Repetition is the same as replacement.

::: {.remark name="Binomial Coefficient Properties"}
<br/>

* $0! = 1$,
* $\displaystyle \binom{n}{0} = \binom{n}{n} = 1$,
* $\displaystyle \binom{n}{1} = \binom{n}{n - 1} = n$,
* $\displaystyle \binom{n}{k} = \binom{n}{n - k}$,
* $\displaystyle \binom{n}{k} = \binom{n - 1}{k - 1} + \binom{n - 1}{k}$, and
* $\displaystyle \sum_{k = 0}^n \binom{n}{k} = 2^n$.
:::

::: {.remark name="Sum Properties"}
Let $x, y \in \R^n$, $c \in \R$ and $k \neq 1$.

* $\displaystyle \sum_i x_i y_i \neq \sum_i x_i \sum y_i$.
* $\displaystyle \sum_i x_i^k \neq \left( \sum_i x_i \right)^k$.
* $\displaystyle \sum_{i=1}^n c = nc$.
* $\displaystyle \sum_{i=1}^n cx_i = n\sum_{i=1}^n x_i$.
* $\displaystyle \sum_i (x_i + y_i) = \sum_i x_i + \sum_i y_i$.
:::

::: {.theorem name="Binomial Expansion"}
$$
(x+y)^n = \sum_{k=0}^n {n \choose k}x^{n-k}y^k = \sum_{k=0}^n {n \choose k}  x^k y^{n-k}
$$
:::

## Random Variables

::: {.definition name="Random Variable"}
Let $(\Omega, \F, P)$ be a probability space. A _random variable_ on $\Omega$ is a function
$$
X : \Omega \to \mathscr{W}(X) \subseteq \R.
$$
If the image $\mathscr{W}(X)$ is countable, $X$ is called a _discrete random variable_, otherwise it is called a _continuous random variable_.
:::

::: {.definition name="Probability Density Function / Probability Mass Function"}
The _probability density function_ (PDF) $f_X : \R \to \R$ of a random variable $X$ is a function defined as
$$
f_X(x) := P(X = x) := P\left(\{\omega \mid X(\omega) = x\}\right).
$$
With $X$ discrete, it is called _probability mass function_.
:::

::: {.remark}
<br/>

* If $X$ is a discrete random variable, then $\displaystyle \sum_i f_X(u_i) = 1$.
* If $X$ is a continuous random variable, then $\displaystyle \int_{-\infty}^{\infty} f_X(t)dt = 1$.
:::

::: {.definition name="Cumulative Distribution"}
The _cumulative distribution function_ (CDF) $F_X : \R \to [0,1]$ of a random variable $X$ is a function defined as
$$
F_X(x) := P(X \leq x) := P\left(\{\omega \mid X(\omega) \leq x\}\right).
$$
If the PDF is given, the CDF can be expressed with
$$
F_X(x) =
\begin{cases}
\displaystyle \sum_{x_i \leq x} f_X(x_i), & \text{discrete} \\
\displaystyle \int_{-\infty}^x f_X(t)dt,  & \text{continuous}
\end{cases}
$$
:::

::: {.remark name="Cumulative Distribution Properties"}
<br/>

* If $t \leq s$, then $F_X(t) \leq F_X(s)$ (*monotonicity*).
* If $t > s$, then $\displaystyle \lim_{t \to s} F_X(t) = F_X(s)$.
* $\displaystyle \lim_{t \to -\infty} F_X(t) = 0$ and $\displaystyle \lim_{t \to \infty} F_X(t) = 1$.
* $\displaystyle P(a \leq X \leq b) = F_X(b) - F_X(a) = \int_{a}^{b} f_X(t)dt$.
* $P(X > t) = 1 - P(X \leq t) = 1 - F_X(t)$.
* $\displaystyle \frac{d}{dx}F_X(x) = f_X(x)$.
:::

::: {.definition name="Expected Value"}
Let $X$ be a random variable. The _expected value_ is defined as
$$
E[X] :=
\begin{cases}
\displaystyle \sum_{x_k \in \mathscr{W}(X)} x_k \cdot f_X(x_k), & \text{discrete} \\
\displaystyle \int_{-\infty}^\infty x \cdot f_X(x)dx,  & \text{continuous}
\end{cases}
$$
:::

::: {.remark name="Expected Value Properties"}
<br/>

* $E[X] \leq E[Y]$ if $\forall \omega : X(\omega) \leq Y(\omega)$,
* $\displaystyle E\left[ \sum_{i=0}^n a_i X_i \right] = \sum_{i=0}^n a_i E\left[ X_i \right]$,
* $\displaystyle E[X] = \sum_{j=1}^\infty P[X \geq j]$ if $\mathscr{W}(X) \subseteq \N_0$,
* $\displaystyle E\left[ \sum_{i=0}^\infty X_i \right] \neq \sum_{i=0}^\infty E\left[ X_i \right]$,
* $E[E[X]] = E[X]$,
* $E[XY]^2 \leq E[X^2]E[Y^2]$, and
* $\displaystyle E\left[ \prod_{i=0}^n X_i \right] = \prod_{i=0}^n E[X_i]$ for independent $X_1, \dots, X_n$.
:::

> The expected value is a linear operator.

::: {.definition name="Raw Moment / Central Moment"}
Let $n \in \N$. The _$n$th (raw) moment_ is defined as
$$
\mu_n' = E\left[X^n\right].
$$
The _$n$th central moment_ is defined as
$$
\mu_n = E\left[\left( X - E[X] \right)^n \right].
$$
:::

::: {.definition name="Expected Value of Functions"}
Let $X$ be a random variable and $Y = g(X)$ with $g : \R \to \R$, then
$$
E[Y] :=
\begin{cases}
\displaystyle \sum_{x_k \in \mathscr{W}(X)} g(x_k) \cdot f_X(x_k), & \text{discrete} \\
\displaystyle \int_{-\infty}^\infty g(x) \cdot f_X(x)dx,  & \text{continuous}
\end{cases}
$$
:::

::: {.definition name="Moment-Generating Function"}
Let $X$ be a random variable. The _moment-generating function_ of $X$ is defined as
$$
M_X(t) := E \left[ e^{tX} \right].
$$
:::

::: {.definition name="Characteristic Function"}
Let $X$ be a random variable. The _characteristic function_ of $X$ is defined as
$$
\varphi_X(t) := E \left[ e^{itX} \right]
$$
where $i = \sqrt{-1} \in \C$.
:::

::: {.definition name="Variance"}
Let $X$ be a random variable with $E[X^2] < \infty$. The _variance_ of $X$ is defined as
$$
\Var[X] := E\left[ (X - E[X])^2 \right].
$$
:::

::: {.remark name="Variance Properties"}
<br/>

* $0 \leq \Var[X] \leq E[X^2]$,
* $\Var[X] = E[X^2] - E^2[X]$,
* $\Var[aX + b] = a^2\Var[X]$,
* $\Var[X] = \Cov(X,X)$,
* $\displaystyle \Var \left[ \sum_{i=1}^n a_i X_i \right] = \sum_{i = 1}^n a_i^2 \Var[X_i] + 2 \sum_{1 \leq i < j \leq n} a_i a_j \Cov(X_i, X_j)$, and
* $\displaystyle \Var \left[ \sum_{i=1}^n X_i \right] = \sum_{i = 1}^n \Var[X_i]$ if $\Cov(X_i, X_j) = 0$, $\forall i \neq j$.
:::

::: {.definition name="Standard Deviation"}
Let $X$ be a random variable with $E[X^2] < \infty$. The _standard deviation_ of $X$ is defined as
$$
\sigma(X) = \sd(X) := \sqrt{\Var[X]}.
$$
:::

::: {.definition name="Covariance"}
Let $X$ and $Y$ be random variables with finite expected value. The _covariance_ of $X$ and $Y$ is defined as
$$
\begin{align*}
\Cov(X, Y) :&=\ E\left[ (X - E[X]) (Y - E[Y]) \right] \\
&=\ E[XY] - E[X]E[Y]
\end{align*}
$$
:::

> The covariance is a measure of correlation between two random variables. $\Cov(X,Y)>0$ if $Y$ tends to increase as $X$ increases. $\Cov(X,Y)<0$ if $Y$ tends to decrease as $X$ increases. If $\Cov(X,Y)=0$, then $X$ and $Y$ are uncorrelated.

::: {.remark name="Covariance Properties"}
<br/>

* $\Cov(aX,bY) = ab\Cov(X,Y)$,
* $\Cov(X + a, Y + b) = \Cov(X,Y)$, and
* $\Cov(aX_1 + bX_2, cY_1 + dY_2)$
$= ac\Cov(X_1,Y_1) + ad\Cov(X_1,Y_2) + bc\Cov(X_2,Y_1) + bd\Cov(X_2,Y_2)$.
:::

::: {.definition name="Correlation"}
Let $X$ and $Y$ be random variables with finite expected value. The _correlation_ of $X$ and $Y$ is defined as
$$
\Corr(X,Y) := \frac{\Cov(X,Y)}{\sqrt{\Var[X] \cdot \Var[Y]}}.
$$
:::

> Correlation is the same as covariance but normalized with values between $-1$ and $1$.

::: {.definition name="Coefficient of Variation"}
The _coefficient of variation_ is defined as the ratio of the standard deviation to the mean.
$$
\ie \quad c_V = \frac{\sigma}{\mu}.
$$
:::

::: {.definition name="Indicator Function"}
The _indicator function_ $\indicator_A : \Omega \to \{0, 1\}$ for a set (event) $A$ is defined as
$$
\indicator_A (\omega) :=
\begin{cases}
1, & \omega \in A \\
0, & \omega \in A^C
\end{cases}
$$
:::

::: {.definition name="Survival Function"}
Let $X$ be a continuous random variable with cumulative distribution function $F(x)$ on the interval $[0, \infty)$. Its _survival function_ or _reliability function_ is defined as
$$
S(x) = P[X > x] = \int_x^\infty f(t)dt = 1 - F(x).
$$
:::

::: {.definition name="Memorylessness"}
Suppose $X$ is a non-negative random variable. The probability distribution of $X$ is _memoryless_ if for any $s, t \geq 0$, we have
$$
P[X > s + t \mid X > t] = P[X > s].
$$
:::

::: {.theorem name="Markov's Inequality"}
Let $X$ be a random variable and $g : \mathscr{W}(X) \to [0, \infty)$ be an increasing function, then, for all $c$ with $g(c) > 0$, we have
$$
P[X \geq c] \leq \frac{E[g(X)]}{g(x)}.
$$

> For practical uses usually $g(x) = x$.

:::

::: {.theorem name="Chebyshev's Inequality"}
Let $X$ be a random variable with $\Var[X] < \infty$, then, if $k > 0$,
$$
P\left[ \lvert X - E[X] \rvert \geq k \right] \leq \frac{\Var[X]}{k^2}.
$$

|         |Lower Bound|  |Upper Bound|
|:--------|:---------:|--|:---------:|
|$k$ units|$\displaystyle P\left[ \left\lvert X - E[X] \right\rvert < k \right] \geq 1 - \frac{\Var[X]}{k^2}$| |$\displaystyle P\left[ \left\lvert X - E[X] \right\rvert \geq k \right] \leq \frac{\Var[X]}{k^2}$|
|$r$ standard deviations|$\displaystyle P\left[ \left\lvert X - E[X] \right\rvert < r\  \sd(X) \right] \geq 1 - \frac{1}{r^2}$| |$\displaystyle P\left[ \left\lvert X - E[X] \right\rvert \geq r\ \sd(X) \right] \leq \frac{1}{r^2}$|
:::

::: {.theorem name="Jensen's Inequality"}
If $X$ is a random variable and $\varphi$ is a convex function, then
$$
\varphi\left(E[X]\right) \leq E\left[\varphi(X)\right].
$$
:::

## Multivariate Distributions

::: {.definition name="Joint Probability Density Function"}
The _joint probability density function_ $f_X : \R^n \to [0, 1]$ with $X = (X_1, \dots, X_n)$ is a function defined as
$$
f_X(x_1, \dots, x_n) := P[X_1 = x_1, \dots, X_n = x_n].
$$
:::

::: {.definition name="Joint Cumulative Distribution Function"}
The _joint cumulative distribution function_ $F_X : \R^n \to [0, 1]$ with $X = (X_1, \dots, X_n)$ is a function defined as
$$
f_X(x_1, \dots, x_n) := P[X_1 \leq x_1, \dots, X_n \leq x_n].
$$
If the joint PDF is given, it can be expressed with
$$
F_X(x) =
\begin{cases}
\displaystyle \sum_{t_1 \leq x_1} \cdots \sum_{t_n \leq x_n} f_X(t), & \text{discrete} \\
\displaystyle \int_{-\infty}^{x_1} \cdots \int_{-\infty}^{x_n} f_X(t)dt, & \text{continuous}
\end{cases}
$$
where $t = (t_1, \dots, t_n)$ and $x = (x_1, \dots, x_n)$.
:::

::: {.remark}
$$
\frac{\partial F_X(x_1, \dots, x_n)}{\partial x_1, \dots, x_n} = f_X(x_1, \dots, x_n)
$$
:::

::: {.definition name="Marginal Probability Density Function"}
The _marginal probability density function_ $f_{X_i} : \R \to [0,1]$ of $X_i$ given a joint PDF $f_X(x_1, \dots, x_n)$ is defined as
$$
f_{X_i}(t_i) =
\begin{cases}
\displaystyle \sum_{t_1} \cdots \sum_{t_{i-1}} \sum_{t_{i+1}} \cdots \sum_{t_n} f_X(t), & \text{discrete} \\
\displaystyle \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} f_X(t)d\tilde{t}, & \text{continuous}
\end{cases}
$$
where $\tilde{t} = (t_1, \dots, t_{i-1}, t_{i+1}, \dots, t_n)$, and in the discrete case $t_k \in \mathscr{W}(X_k)$.
:::

> The idea of the marginal probability is to ignore all other random variables and consider only the one we’re interested in.

::: {.definition name="Marginal Cumulative Distribution Function"}
The _marginal cumulative distribution function_ $F_{X_i} : \R \to [0,1]$ of $X_i$ given a joint CDF $F_X(x_1, \dots, x_n)$ is defined as
$$
F_{X_i}(x_i) := \lim_{x_{j\neq i} \to \infty} F_X(x_1, \dots, x_n).
$$
:::

::: {.definition name="Conditional Distribution"}
The _conditional distribution_ $f_{X \mid Y} : \R \to [0,1]$ is defined as
$$
\begin{align*}
f_{X\mid Y} (x \mid y) :&=\ P[X = x \mid Y = y] \\
&=\ \frac{P[X = x, Y = y]}{P[Y = y]} \\
&=\ \frac{\text{Joint PDF}}{\text{Marginal PDF}}
\end{align*}
$$
:::

::: {.definition name="Independence"}
The random variables $X_1, \dots, X_n$ are _independent_ if
$$
F_{X_1, \dots, X_n} (x_1, \dots, x_n) = \prod_{i=1}^n F_{X_i}(x_i).
$$
Similarly, if their PDF is absolutely continuous, they are _independent_ if
$$
f_{X_1, \dots, X_n} (x_1, \dots, x_n) = \prod_{i=1}^n f_{X_i}(x_i).
$$
:::

::: {.theorem name="Function Independence"}
If the random variables $X_1, \dots, X_n$ are independent and $f_i : \R \to \R$ is a function with $Y_i := f_i(X_i)$, then also $Y_1, \dots, Y_n$ are independent.
:::

::: {.theorem}
The random variables $X_1, \dots, X_n$ are independent if and only if, $\forall B_i \subseteq \mathscr{W}(X_i)$, we have
$$
P[X_1 \in B_1, \dots, X_n \in B_n] = \prod_{i=1}^n P[X_i \in B_i].
$$
:::

::: {.definition name="Joint Expected Value"}
The _joint expected value_ of a random variable $Y = g(X_1, \dots, X_n) = g(X)$ is defined as
$$
E[Y] =
\begin{cases}
\displaystyle \sum_{t_1} \cdots \sum_{t_n} g(t) f_X(t), & \text{discrete} \\
\displaystyle \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} g(t)f_X(t)dt, & \text{continuous}
\end{cases}
$$
where $t = (t_1, \dots, t_n)$, and in the discrete case $t_k \in \mathscr{W}(X_k)$.
:::

::: {.definition name="Conditional Expected Value"}
The _conditional expected value_ of random variables $X$ and $Y$ is defined as
$$
E[X\mid Y] :=
\begin{cases}
\displaystyle \sum_{x \in \R} x \cdot f_{X\mid Y}(x\mid y), & \text{discrete} \\
\displaystyle \int_{-\infty}^\infty x \cdot f_{X\mid Y}(x\mid y)dx,  & \text{continuous}
\end{cases}
$$
:::

::: {.remark}
<br/>

* $E[X] = E[E[X\mid Y]]$,
* $E[X\mid Y] = E[X]$ if $X$ and $Y$ are independent, and
* $\Var[X] = E\left[\Var[X \mid Y]\right] + \Var\left[E[X \mid Y]\right]$.
:::

::: {.definition}
Let $Y = g(X_1, \dots, X_n) = g(X)$.
$$
P[Y \in C] = \int_{A_C} f_X(t)dt
$$
where $A_C = \{ x = (x_1, \dots, x_n) \in \R^n : g(x) \in C \}$ and $t = (t_1, \dots, t_n)$.
:::

::: {.theorem name="Transformation"}
When $g(\cdot)$ is a strictly increasing function, then
$$
\begin{align}
F_Y(y) &=\ \int_{-\infty}^{g^{-1}(y)} f_X(x)dx \\
&=\ F_X(g^{-1}(y)) \\
f_Y(y) &=\ f_X\left( g^{-1}(y) \right) \frac{\partial}{\partial y} g^{-1}(y)
\end{align}
$$
When $g(\cdot)$ is a strictly decreasing function, then
$$
\begin{align}
F_Y(y) &=\ \int_{g^{-1}(y)}^{\infty} f_X(x)dx \\
&=\ 1 - F_X(g^{-1}(y)) \\
f_Y(y) &= - f_X\left( g^{-1}(y) \right) \frac{\partial}{\partial y} g^{-1}(y)
\end{align}
$$
Equivalently, $\displaystyle f_X(x) = f_Y(g(x)) \left\lvert \frac{\partial g(x)}{\partial x} \right\rvert$.

In higher dimensions, the derivative generalizes to the determinant of the Jacobian matrix --- the matrix with $\displaystyle \mathscr{J}_{ij} = \frac{\partial x_i}{\partial y_j}$. Thus, for real-valued vector $x$ and $y$,
$$
f_X(x) = f_Y(g(x)) \left\lvert\det \left( \frac{\partial g(x)}{\partial x} \right) \right\rvert.
$$
:::

::: {.theorem name="Transformation"}
Let $X$ have a continuous CDF $F_X(\cdot)$ and define $Y = F_X(X)$. Then $Y \sim \Unif[0,1]$, i.e., $F_Y(y) = y$ for $y \in [0,1]$.
:::

::: {.theorem}
For $X \sim F_X$ and $Y \sim F_Y$, if $M_X$ and $M_Y$ exist, and $M_X(t) = M_Y(t)$ for all $t$ in some neighbourhood of zero, then $F_X(u) = F_Y(u)$ for all $u$.
:::

::: {.remark name="Monte Carlo Integration"}
Let $\displaystyle I = \int_a^b g(x)dx$ be the integral of a function that is hard to evaluate, then
$$
\begin{align*}
I &=\ \int_a^b g(x)dx \\
&=\ (b - a) \int_a^b g(x) \frac{1}{b - a} dx \\
&=\ (b - a) \int_{-\infty}^{\infty} g(x) f_{\Unif}(x) dx \\
&=\ (b - a) \cdot E[g(\Unif)]
\end{align*}
$$
where $\Unif(a,b)$ is uniformly distributed. Then, by the **Law of Large Numbers**, we know that we can approximate $E[g(\Unif)]$ by randomly sampling $u_1, u_2, \dots$ from $\Unif(a,b)$.
$$
\frac{b - a}{n} \sum_{i=1}^n g(u_i) \underset{n\to\infty}{\longrightarrow} (b - a) \cdot E[g(\Unif)].
$$
:::

::: {.remark name="Sum"}
Let $X_1, \dots, X_n$ be independent random variables, then the sum $Z = X_1 + \cdots + X_n$ has a PDF $f_Z(z)$ evaluated with a convolution between all PDFs
$$
f_Z(z) = (f_{X_1}(x_1) * \cdots * f_{X_n}(x_n))(z).
$$
In the special case where $Z = X + Y$, we have
$$
f_Z(z) =
\begin{cases}
\displaystyle \sum_{x_k \in \mathscr{W}(X)} f_X(x_k) f_Y(z - x_k), & \text{discrete} \\
\displaystyle \int_{-\infty}^\infty f_X(t) f_Y(z - t)dt, & \text{continuous}
\end{cases}
$$
:::

> Often it is much easier to use properties of the random variables to find the sum instead of evaluating the convolution.

::: {.remark name="Product"}
Let $X$ and $Y$ be independent random variables. To evaluate the PDF and the CDF of $Z = XY$, we proceed as
$$
\begin{align}
F_Z(z) &=\ P[XY \leq z] \\
&=\ P\left[ X \geq \frac{z}{Y}, Y < 0 \right] \\
&+\ P\left[ X \leq \frac{z}{Y}, Y > 0 \right] \\
&=\ \int_{-\infty}^0 \left[ \int_{\frac{z}{y}}^\infty f_X(x)dx \right] f_Y(y)dy \\
&+\ \int_0^\infty \left[ \int_{-\infty}^{\frac{z}{y}} f_X(x)dx \right] f_Y(y)dy
\end{align}
$$
where the PDF  is
$$
f_Z(z) = \frac{dF_Z}{dz}(z) = \int_{-\infty}^\infty f_Y(y)f_X\left(\frac{z}{y}\right) \frac{1}{\lvert y \rvert} dy.
$$
:::

::: {.remark name="Quotient"}
Let $X$ and $Y$ be independent random variables. To evaluate the PDF and the CDF of $\displaystyle Z = \frac{X}{Y}$, we proceed as
$$
\begin{align}
F_Z(z) &=\ P\left[\frac{X}{Y} \leq z\right] \\
&=\ P\left[ X \geq zY, Y < 0 \right] \\
&+\ P\left[ X \leq zY, Y > 0 \right] \\
&=\ \int_{-\infty}^0 \left[ \int_{yz}^\infty f_X(x)dx \right] f_Y(y)dy \\
&+\ \int_0^\infty \left[ \int_{-\infty}^{yz} f_X(x)dx \right] f_Y(y)dy
\end{align}
$$
where the PDF  is
$$
f_Z(z) = \frac{dF_Z}{dz}(z) = \int_{-\infty}^\infty \lvert y\rvert\ f_X\left(yz\right) f_Y(y) dy.
$$
:::

::: {.definition name="Covariance Matrix / Correlation Matrix"}
Let $X_1, \dots, X_n$ be random variables. Let $\sigma_{ij} = \Cov(X_i, X_j)$ and $\rho_{ij} = \Corr(X_i, X_j)$ for every $i, j = 1, 2, \dots, n$. The _covariance_ and _correlation matrices_ as defined respectively as
$$
\Sigma =
\begin{bmatrix}
\sigma_{11} & \sigma_{12} & \cdots & \sigma_{1n} \\
\sigma_{21} & \sigma_{22} & \cdots & \sigma_{2n} \\
\vdots      & \vdots      & \ddots & \vdots \\
\sigma_{n1} & \sigma_{n2} & \cdots & \sigma_{nn}
\end{bmatrix}_{n\times n} \\ \\
P =
\begin{bmatrix}
1         & \rho_{12} & \cdots & \rho_{1n} \\
\rho_{21} & 1         & \cdots & \rho_{2n} \\
\vdots    & \vdots    & \ddots & \vdots \\
\rho_{n1} & \rho_{n2} & \cdots & 1
\end{bmatrix}_{n\times n}
$$
:::

::: {.theorem}
Let $X$ and $Y$ be random variables. $\lvert\rho_{X,Y}\rvert = 1$ if and only if there exist $\alpha, \beta \in \R$ such that $Y = \alpha + \beta X$.
:::