Last time we introduced basic distributions for discrete random variables — our first models! But a model of a single discrete random variable isn't all that interesting... Contingency tables allow us to model and reason about the joint distribution of two categorical random variables. Two might not sound like a lot — we'll get to more complex models soon enough! — but it turns out plenty of important questions boil down to understanding the relationship between two variables.
We used the College Football National Championship to motivate our analyses in the last lecture, but I have to admit, I have a love-hate relationship with football. While it's fun to watch, it's increasingly clear that repetitive head injuries sustained in football can have devastating consequences, including an increased risk of chronic traumatic encephalopathy (CTE). A recent study from {cite:t}mckee2023neuropathologic in JAMA Neurology showed that CTE can be found even in amateur high school and college athletes, and the New York Times highlighted their research in a very sad article last fall.
The only way to definitely diagnose CTE is via autopsy. {cite:t}mckee2023neuropathologic studied the brains of 152 people who had played contact sports and died under the age of 30 from various causes including injury, overdose, suicide, and others (but not from neurodegenerative disease). Of those 152 people, 92 had played football and the rest had played other sports like soccer, hockey, wrestling, rugby, etc. Of the 152 people, 63 were found to have CTE upon neuropathologic evaluation. Of the 92 football players, 48 had CTE.
We can summarize that result in a 2
| No CTE | CTE | Total | |
|---|---|---|---|
| No Football | 45 | 15 | 60 |
| Football | 44 | 48 | 92 |
| Total | 89 | 63 | 152 |
:::{admonition} Questions With this data, can we say that playing football is associated with CTE? If so, how strong is the association? Can we say whether this association is causal? What are some caveats to consider when interpreting this data? :::
The table above is an example of a contingency table. It represents a sample from a joint distribution of two random variables,
More generally, let
One of the key questions in the analysis of contingency tables is whether
Equivalently, the variables are independent if the conditionals are homogeneous, \begin{align*} X \perp Y \iff \pi_{j|i} = \frac{\pi_{ij}}{\pi_{i \bullet}} = \frac{\pi_{i \bullet} \pi_{\bullet j}}{\pi_{i \bullet}} = \pi_{\bullet j} ; \forall i,j. \end{align*}
We don't usually observe the probabilities
Under a Poisson sampling model,
\begin{align*}
X_{ij} &\sim \mathrm{Po}(\lambda_{ij})
\end{align*}
where
If we condition on the total count, we obtain a multinomial sampling model,
\begin{align*}
\mathrm{vec}(\mbX) \mid X_{\bullet \bullet}= x_{\bullet \bullet} &\sim \mathrm{Mult}(x_{\bullet \bullet}, \mathrm{vec}(\mbPi)),
\end{align*}
where
When the row variables are explanatory variables, we often model each row of counts as conditionally independent given the row-sums, \begin{align*} \mbX_{i} \mid X_{i \bullet} = x_{i \bullet} &\sim \mathrm{Mult}(x_{i \bullet}, \mbpi_{\cdot \mid i}) \end{align*} with pmf \begin{align*} \Pr(\mbX=\mbx \mid X_{1 \bullet} = x_{1 \bullet}, \ldots X_{I \bullet} = x_{I \bullet}) &= \prod_{i=1}^I \mathrm{Mult}(\mbx_i \mid x_{i \bullet}, \mbpi_{\cdot \mid i}) \ &= \prod_{i=1}^I \left[ {x_{i \bullet} \choose x_{i1}; \cdots; x_{iJ}} \prod_{j=1}^J \pi_{j \mid i}^{x_{ij}} \right] \end{align*}
Sometimes we condition on both the row and column sums. For 2x2 tables, under the null hypothesis that the rows are independent (i.e., assuming homogenous conditionals), the resulting sampling distribution is the hypergeometric, \begin{align*} X_{11} \mid X_{\bullet \bullet} = x_{\bullet \bullet}, X_{1 \bullet} = x_{1 \bullet}, X_{\bullet 1} = x_{\bullet 1} &\sim \mathrm{HyperGeom}(x_{\bullet \bullet}, x_{1 \bullet}, x_{\bullet 1}) \end{align*} with pmf \begin{align*} \mathrm{HyperGeom}(x_{11}; x_{\bullet \bullet}, x_{1 \bullet}, x_{\bullet 1}) &= \frac{{x_{1 \bullet} \choose x_{11}} {x_{\bullet \bullet} - x_{1 \bullet} \choose x_{\bullet 1} - x_{11}}}{{x_{\bullet \bullet} \choose x_{\bullet 1}}} \end{align*}
:::{admonition} Deriving the hypergeometric distribution by Bayes' rule :class: dropdown
We can arrive at this conditional distribution using Bayes' rule. The following is adapted from {cite:t}blitzstein2019introduction (Ch 3.9). We will abbreviate some of the probability notation so that it's not so cumbersome. Also, we'll index our rows and columns starting with 0, to be consistent with our notation below. Under the independent Poisson sampling model,
\begin{align*}
\Pr(x_{11} \mid x_{\bullet \bullet}, x_{1 \bullet}, x_{\bullet 1})
&=
\frac{\Pr(x_{11} \mid x_{\bullet \bullet}, x_{1 \bullet}) \Pr(x_{\bullet 1} \mid x_{11}, x_{\bullet \bullet}, x_{1 \bullet})}{\Pr(x_{\bullet 1} \mid x_{\bullet \bullet}, x_{1 \bullet})} \
&=
\frac{\mathrm{Bin}(x_{11}; x_{1 \bullet}, \pi_{11}) \mathrm{Bin}(x_{01}; x_{0 \bullet}, \pi_{01})}{\Pr(x_{\bullet 1} \mid x_{\bullet \bullet}, x_{1 \bullet})},
\end{align*}
noting that
Under the null hypothesis of independence,
Substituting in the binomial pmf yields,
\begin{align*}
\Pr(x_{11} \mid x_{\bullet \bullet}, x_{1 \bullet}, x_{\bullet 1})
&=
\frac
{
\left({x_{1 \bullet} \choose x_{11}} p^{x_{11}} (1-p)^{x_{1 \bullet} - x_{11}} \right)
\left({x_{\bullet \bullet} - x_{1 \bullet} \choose x_{\bullet 1} - x_{11}} p^{x_{\bullet 1} - x_{11}} (1-p)^{x_{\bullet \bullet} - x_{1 \bullet} - x_{\bullet 1} + x_{11}} \right)
}
{
{x_{\bullet \bullet} \choose x_{1 \bullet}} p^{x_{\bullet 1}} (1 - p)^{x_{\bullet \bullet} - x_{\bullet 1}}
} \
&=
\frac
{
{x_{1 \bullet} \choose x_{11}}
{x_{\bullet \bullet} - x_{1 \bullet} \choose x_{\bullet 1} - x_{11}}
}
{
{x_{\bullet \bullet} \choose x_{1 \bullet}}
} \
&= \mathrm{HyperGeom}(x_{11}; x_{\bullet \bullet}, x_{1 \bullet}, x_{\bullet 1}).
\end{align*}
Interestingly, the probability
Contingency tables are often used to compare two groups
For a Bernoulli random variable with probability
For a 2x2 table, each row defines a Bernoulli conditional,
\begin{align*}
Y \mid X=i &\sim \mathrm{Bern}(\pi_{1|i}) & \text{for } i &\in {0,1},
\end{align*}
where, recall,
\begin{align*}
\pi_{1|i} = \frac{\pi_{i1}}{\pi_{i0} + \pi_{i1}}.
\end{align*}
The odds for row
The odds ratio
The odds ratio is non-negative,
For inference it is often more convenient to work with the log odds ratio, \begin{align*} \log \theta &= \log \pi_{11} + \log \pi_{00} - \log \pi_{10} - \log \pi_{01}. \end{align*} Under independence, the log odds ratio is 0. The magnitude of the log odds ratio represents the strength of association.
We often need to control for confounding variables
In this setting, controlling for
We say that
For 2x2xK tables, we define the conditional log odds ratios as,
\begin{align*}
\log \theta_{k} = \log \frac{\pi_{11|k} \pi_{00|k}}{\pi_{10|k} \pi_{01|k}}.
\end{align*}
Conditional independence corresponds to
Conditional independence does not imply marginal independence. Indeed, measures of marginal association and conditional association can even differ in sign. This is called Simpson's paradox.
Given a sample of counts
We can estimate 95% Wald confidence intervals usign the asymptotic normality of the estimator, \begin{align*} \log \hat{\theta} \pm 1.96 , \hat{\sigma}(\log \hat{\theta}) \end{align*} where \begin{align*} \hat{\sigma}(\log \hat{\theta}) &= \left(\frac{1}{x_{11}} + \frac{1}{x_{00}} + \frac{1}{x_{10}} + \frac{1}{x_{01}} \right)^{\frac{1}{2}} \end{align*} is an estimate of the standard error using the delta method.
The sample log odds ratio is a nonlinear function of the maximum likelihood estimates of $\hat{\pi}{ij}$,
\begin{align*}
\hat{\pi}{ij} &= \frac{x_{ij}}{n}.
\end{align*}
where
Let $\hat{\mbpi} = \mathrm{vec}(\hat{\mbPi}) = (\hat{\pi}{11}, \hat{\pi}{10}, \hat{\pi}{01}, \hat{\pi}{00})$ denote the vector of probability estimates.
The MLE is asymptotically normal with variance given by the inverse Fisher information,
\begin{align*}
\sqrt{n}(\hat{\mbpi} - \mbpi) \to \mathrm{N}(0, \cI(\mbpi)^{-1})
\end{align*}
where
\begin{align*}
\cI(\mbpi)^{-1}
&=
\begin{bmatrix}
\pi_{11}(1-\pi_{11}) & -\pi_{11} \pi_{10} & - \pi_{11} \pi_{01} & -\pi_{11} \pi_{00} \
-\pi_{10} \pi_{11} & \pi_{10} (1 - \pi_{10}) & - \pi_{10} \pi_{01} & -\pi_{10} \pi_{00} \
-\pi_{01} \pi_{11} & -\pi_{01} \pi_{10} & \pi_{01} (1 - \pi_{01}) & -\pi_{01} \pi_{00} \
-\pi_{00} \pi_{11} & -\pi_{00} \pi_{10} & -\pi_{00} \pi_{01} & \pi_{00} (1 - \pi_{00})
\end{bmatrix}
\end{align*}
The (multivariate) delta method is a way of estimating the variance of a scalar function of the estimator,
For the log odds ratio,
Last time, we derived Wald confidence intervals from the acceptance region of a Wald hypothesis test. We could do the reverse here to to test independence in
Let
The likelihood ratio test compares the maximum likelihood under the constrained set to the maximum likelihood under the larger space of all probabilities,
\begin{align*}
\lambda &=
-2 \log \frac
{
\sup_{\mbpi_{i \bullet}, \mbpi_{\bullet j} \in \Delta_{I-1} \times \Delta_{J-1}} p(\mbx; \mbpi_{i \bullet} \mbpi_{\bullet j}^\top)
}
{
\sup_{\mbPi \in \Delta_{IJ-1}} p(\mbx; \mbPi)
}
\end{align*}
The maximum likelihoods estimates of the constrained model are $\hat{\pi}{i \bullet} = x{i \bullet} / x_{\bullet \bullet}$ and $\hat{\pi}{\bullet j} = x{\bullet j} / x_{\bullet \bullet}$; under the unconstrained model they are $\hat{\pi}{ij} = x{ij} / x_{\bullet \bullet}$. Plugging these estimates in yields,
\begin{align*}
\lambda &=
-2 \log \frac
{
\prod_{ij} \left( \frac{x_{i \bullet} x_{\bullet j}}{x_{\bullet \bullet}^2} \right)^{x_{ij}}
}
{
\prod_{ij} \left( \frac{x_{i j}}{x_{\bullet \bullet}} \right)^{x_{ij}}
} \
&= -2 \sum_{ij} x_{ij} \log \frac{\hat{\mu}{ij}}{x{ij}}
\end{align*}
where $\hat{\mu}{ij} = x{\bullet \bullet} \hat{\pi}{i \bullet} \hat{\pi}{\bullet j} = x_{i \bullet} x_{\bullet j} / x_{\bullet \bullet}$ is the expected value of
Under the null hypothesis,
The p-value for the likelihood ratio test is based on an asymptotic chi-squared distribution, which only holds as
Consider testing the null hypothesis
Contingency tables are fundamental tools for studying the relationship between two categorical random variables. We discussed models sampling contingency tables, conditioning on various marginals, as well as various measures of association between the random variables. Then we presented methods for inferring associations and testing hypotheses of independence. However, these methods were ultimately limited to just two (often binary) variables. Next, we'll consider models for capturing relationships between a response and several covariates.