Logistic regression was a special case of a more general class of models called generalized linear models (GLMs). In a GLM, the conditional distribution
To construct a generalized linear model with exponential family observations, we set \begin{align*} \E[y_i \mid \mbx_i] &= f(\mbbeta^\top \mbx_i). \end{align*}
From above, this implies,
\begin{align*}
\nabla A(\eta_i) &= f(\mbbeta^\top \mbx_i) \
\Rightarrow \eta_i &= [\nabla A]^{-1} \big( f(\mbbeta^\top \mbx_i) \big),
\end{align*}
when
The canonical mean function is
The (canonical) link function is the inverse of the (canonical) mean function.
Consider the Bernoulli distribution once more. The gradient of the log normalizer is, \begin{align*} \nabla A(\eta) &= \nabla \log (1 + e^\eta) = \frac{e^\eta}{1+ e^\eta} \end{align*} This is the logistic function!
Thus, logistic regression is a Bernoulli GLM with the canonical mean function.
Canonical mean functions lead to nice math. Consider the log joint probability,
\begin{align*}
\cL(\mbbeta)
&= \sum_{i=1}^n \langle t(y_i), \eta_i \rangle - A(\eta_i) + c \
&= \sum_{i=1}^n \langle t(y_i), \mbbeta^\top \mbx_i \rangle - A(\mbbeta^\top \mbx_i) + c,
\end{align*}
where we have assumed a canonical mean function so
The gradient is,
\begin{align*}
\nabla \cL(\mbbeta)
&= \sum_{i=1}^n \langle t(y_i), \mbx_i \rangle - \langle \nabla A(\mbbeta^\top \mbx_i), , \mbx_i \rangle\
&= \sum_{i=1}^n \langle t(y_i) - \E[t(Y); \eta_i], , \mbx_i \rangle
\end{align*}
where
In many cases,
And in that case the Hessian is \begin{align*} \nabla^2_{\mbbeta} \cL(\mbbeta) &= - \sum_{i=1}^n \nabla^2 A(\mbbeta^\top \mbx_i) , \mbx_i \mbx_i^\top \ &= -\sum_{i=1}^n \Var[t(Y); \eta_i] , \mbx_i \mbx_i^\top \end{align*}
Now recall the Newton's method updates, written here in terms of the change in weights, \begin{align*} \Delta \mbbeta &= - [\nabla^2 \cL(\mbbeta)]^{-1} \nabla \cL(\mbbeta) \ &= \left[\sum_{i=1}^n \Var[t(Y); \eta_i] , \mbx_i \mbx_i^\top \right]^{-1} \left[\sum_{i=1}^n (y_i - \hat{y}_i) \mbx_i \right] \end{align*}
Letting
This is iteratively reweighted least squares (IRLS) with weights
When we choose an arbitrary covariance matrix, the expressions are a bit more complex. Let's focus on the case where
The gradient is, \begin{align*} \nabla \cL(\mbbeta) &= \sum_{i=1}^n \left(y_i - \hat{y}i\right) \frac{\partial \eta_i}{\partial \mbbeta}. \end{align*} Applying the inverse function theorem, as above, yields, \begin{align*} \frac{\partial \eta_i}{\partial \mbbeta} &= \mathrm{Var}[Y]^{-1} \mbx_i = \mbx_i / w_i, \end{align*} and \begin{align*} \nabla \cL(\mbbeta) &= \sum{i=1}^n \left(\frac{y_i - \hat{y}_i}{w_i} \right) f'(\mbbeta^\top \mbx_i) \mbx_i. \end{align*}
We can perform maximum likelihood estimation via Newton's method, but do the resulting parameters
For binomial GLMs (including logistic regression) and Poisson GLMs, the saturated model conditions on the mean equaling the observed response. For example, in a Poisson GLM the saturated model's log probability is,
\begin{align*}
\log p_{\mathsf{sat}}(\mby)
&= \sum_{i=1}^n \log \mathrm{Po}(y_i; y_i) \
&= \sum_{i=1}^n -\log y_i! + y_i \log y_i - y_i.
\end{align*}
The likelihood ratio statistic in this case is
\begin{align*}
-2 \log \frac{p(\mby \mid \mbX; \hat{\mbbeta})}{p_{\mathsf{sat}}(\mby)}
&= 2 \sum_{i=1}^n \log \mathrm{Po}(y_i; y_i) - \log \mathrm{Po}(y_i; \hat{\mu}i)\
&= 2 \sum{i=1}^n y_i \log \frac{y_i}{\hat{\mu}i} + \hat{\mu}i - y_i \
&= \sum{i=1}^n r{\mathsf{D}}(y_i, \hat{\mu}_i)^2
\end{align*}
where
Moreover, this statistic is just the deviance (twice the KL divergence) between the two models, \begin{align*} 2 \KL{p_{\mathsf{sat}}(\mbY)}{p(\mbY \mid \mbX=\mbx; \hat{\mbbeta})} &= 2 \E_{p_{\mathsf{sat}}} \left[\log \frac{p_{\mathsf{sat}}(\mby)}{p(\mby \mid \mbX; \hat{\mbbeta})} \right] \ &= 2 \sum_{i=1}^n \KL{\mathrm{Po}(y_i)}{\mathrm{Po}(\hat{\mu}_i)} \ &\triangleq D(\mby, \hat{\mbmu}). \end{align*} We've shown the deviance for the case of a Poisson GLM, but the same idea holds for GLMs with other exponential family distributions. Larger deviance implies a poorer fit.
The baseline model is typically a GLM with only an intercept term, in which case the MLE is
:::{admonition} Note on intercepts
In models with an (unregularized) intercept term, the MLE should be such that
As with linear models where we use the fraction of variance explaiend (
The difference in deviance between two models with predicted means
Just like in standard linear models, we should inspect the residuals in generalized linear models for evidence of model misspecification. For example, we can plot the residuals as a function of
Finally, another common approach to model selection and comparison is to use cross-validation. The idea is to approximate out-of-sample predictive accuracy by randomly splitting the training data.
Leave-one-out cross validation (LOOCV) withholds one datapoint out at a time, estimates parameters using the other
This approximates,
\begin{align}
\E_{p^\star(x, y)}[\log p(y \mid x, {x_i, y_i}{i=1}^n)] &\approx
\frac{1}{n} \sum{i=1}^n \log p(y_i \mid x_i, {x_j, y_j}_{j \neq i}).
\end{align}
where
For small
Cross-validated predictive log likelihood estimates are similar to the \emph{jackknife} resampling method in the sense that it is estimating an expectation wrt the unknown data-generating distribution
Generalized linear models allow us to build flexible regression models that respect the domain of the response variable. Logistic regression is a special case of a Bernoulli GLM with the canonical link function. For categorical data, we could use a categorical distribution with the softmax link function, and for count data, we could use a Poisson GLM with exponential, softplus, or other link functions. Leveraging the deviance and deviance residuals for exponential family distribiutions, we can derive analogs of familiar terms from linear modeling, like the fraction of variance explained and the residual analysis.
Next, we'll consider techniques for Bayesian analysis in GLMs, which allow for more expressive moodels. That will give us an excuse to finally dig into Bayesian inference methods.