One reason we like linear and generalized linear models is that the parameters are readily interpretable. The parameter
The Lasso yields sparse solutions for linear models by solving an
It's tempting to just use vanilla gradient descent to find the minimizer,
\begin{align*}
\mbbeta^{(t+1)} &\leftarrow \mbbeta^{(t)} - \alpha_t \nabla \cL(\mbbeta^{(t)}),
\end{align*}
where
One way to think about this update is as the solution to a quadratic minimization problem,
\begin{align*}
\mbbeta^{(t+1)} &\leftarrow
\arg \min_{\mbz} \cL(\mbbeta^{(t)}) + \nabla \cL(\mbbeta^{(t)})^\top (\mbz - \mbbeta^{(t)}) + \frac{1}{2 \alpha_t} | \mbz - \mbbeta^{(t)}|_2^2
\end{align*}
We can think of the surrogate problem as a second order approximation of the objective in which the Hessian is replaced with
The
Unfortunately, the Lasso objective it is not continuously differentiable: the gradient at
Proximal gradient descent is an optimization algorithm for convex objectives that decompose into a differentiable part and a non-differentiable part,
\begin{align*}
\cL(\mbbeta) &= \cL_{\mathsf{d}}(\mbbeta) + \cL_{\mathsf{nd}}(\mbbeta)
\end{align*}
where
The idea is to stick as close to vanilla gradient descent as possible, while correcting for the non-differentiable part of the objective. To that end, let's apply the quadratic approximation logic to just the differentiable part so that our update becomes, \begin{align*} \mbbeta^{(t+1)} &\leftarrow \arg \min_{\mbz} \cL_{\mathsf{d}}(\mbbeta^{(t)}) + \nabla \cL_{\mathsf{d}}(\mbbeta^{(t)})^\top (\mbz - \mbbeta^{(t)}) + \frac{1}{2 \alpha_t} | \mbz - \mbbeta^{(t)}|2^2 + \cL{\mathsf{nd}}(\mbbeta^{(t)}) \ &= \arg \min_{\mbz} \frac{1}{2 \alpha_t} | \mbz - (\mbbeta^{(t)} - \alpha_t \nabla \cL_{\mathsf{d}}(\mbbeta^{(t)})) |2^2 + \cL{\mathsf{nd}}(\mbz) \ \end{align*} The resulting update balances two parts:
- Stay close to the vanilla gradient descent update,
$\mbbeta^{(t)} - \alpha_t \nabla \cL_{\mathsf{d}}(\mbbeta^{(t)})$ . - Also minimize the non-differentiable part of the objective,
$\cL_{\mathsf{nd}}(\mbbeta^{(t)})$ .
As a sanity check, note that we recover vanilla gradient descent with
We call the function, \begin{align} \mathrm{prox}(\mbu; \alpha_t) &= \arg \min_{\mbz} \frac{1}{2 \alpha_t} | \mbz - \mbu |2^2 + \cL{\mathsf{nd}}(\mbz) \end{align} the proximal mapping.
:::{admonition} Notes :class: tip
- The proximal mapping depends on the form of the non-differentiable part of the objective, even though we have suppressed that in the notation.
- However, it does not depend on the form of the continuous part of the objective. :::
With this definition, the proximal gradient descent algorithm is,
:label: prox_grad
**Input:** Initial parameters $\mbbeta^{(0)}$, proximal mapping $\mathrm{prox}(\cdot; \cdot)$.
- **For** $t=1,\ldots, T$
- Set $\mbbeta^{(t)} \leftarrow \mathrm{prox}(\mbbeta^{(t-1)} - \alpha_t \nabla \cL_{\mathsf{d}}(\mbbeta^{(t-1)}); \alpha_t)$.
**Return** $\mbbeta^{(T)}$.
So far, it's not obvious that this framing is helpful. We still have a potentially challenging optimization problem to solve in computing the proximal mapping. However, for many problems of interest, the proximal mapping has simpled closed solutions.
Consider the Lasso problem. The objective decomposes into convex differentiable and non-differentiable parts, \begin{align*} \cL_{\mathsf{d}}(\mbbeta) &= \frac{1}{2} |\mby - \mbX \mbbeta|2^2 \ \cL{\mathsf{nd}}(\mbbeta) &= \lambda |\mbbeta|_1. \end{align*}
The proximal mapping is, \begin{align*} \mathrm{prox}(\mbu; \alpha_t) &= \arg \min_{\mbz} \frac{1}{2 \alpha_t} | \mbz - \mbu |2^2 + \lambda |\mbz|1 \ &= \arg \min{\mbz} \sum{j=1}^p \frac{1}{2 \alpha_t} (z_j - u_j)^2 + \lambda |z_j| \end{align*} It separates into optimization problems for each coordinate!
Isolate one coordinate and consider the case where
Now let's plug in the gradient of the differentiable part, \begin{align*} \nabla \cL_{\mathsf{d}}(\mbbeta) &= \mbX^\top (\mby - \mbX \mbbeta). \end{align*} Substituting this into the proximal gradient descent algorithm yields what is sometimes called the iterative soft-thresholding algorithm (ISTA),
:label: ista
**Input:** Initial parameters $\mbbeta^{(0)}$, covariates $\mbX \in \reals^{n \times p}$, responses $\mby \in \reals^n$
- **For** $t=1,\ldots, T$
- Set $\mbbeta^{(t)} \leftarrow S_{\alpha_t \lambda}(\mbbeta^{(t-1)} - \alpha_t \mbX^\top (\mby - \mbX \mbbeta^{(t-1)}))$.
**Return** $\mbbeta^{(T)}$.
If
One great thing about proximal gradient descent is its generality. We could easily apply it to
However, we saw that Newton's method yielded significantly faster convergence rates of
To obtain a proximal Newton method, we proceed in the same fashion as above, but rather than approximating the second order term with
:::{admonition} Note
:class: tip
Note that proximal mapping for proximal gradient descent corresponds to the special case in which
Let
As with Newton's method, however, we often need to use damped updates,
\begin{align*}
\mbbeta^{(t+1)} &= \mbbeta^{(t)} + \alpha_t (\hat{\mbbeta}^{(t+1)} - \mbbeta^{(t)}),
\end{align*}
with step size
The challenge, as we will see below, is that solving the proximal Newton mapping can be more challenging.
Let's consider the proximal Newton mapping for
However, if we fix all coordinates but one, the problem is tractable. First, expand the quadratic form in the proximal mapping,
\begin{align*}
\mathrm{prox}(\mbu; \mbH_t)
&= \arg \min_{\mbz} \sum_{j=1}^p \left[ \frac{1}{2} H_{t,j,j} (z_j - u_j)^2 + \sum_{k=j+1}^p H_{t,j,k} (z_j - u_j) (z_k - u_k) + \lambda |z_j| \right].
\end{align*}
As a function of a single coordinate
To implement the proximal Newton step, we perform coordinate descent, iteratively minimizing with respect to one coordinate at a time until convergence.
:label: prox-newton
**Input:** initial guess $\mbz^{(0)}$, target $\mbu \in \reals^p$, Hessian $\mbH_t \in \reals^{p \times p}_{\succeq 0}$, num iterations $M$
- **For** $m=1,\ldots, M$
- **For** $j=1, \ldots, p$
- Set
\begin{align*}
\mu_j^{(m)} = H_{t,j,j}^{-1} \left(H_{t,j,j} u_j - \sum_{k < j} H_{t,j,k} (z_k^{(m)} - u_k) - \sum_{k > j} H_{t,j,k} (z_k^{(m-1)} - u_k) \right)
\end{align*}
- Set $z_j^{(m)} \leftarrow S_{\lambda/H_{t,j,j}}(\mu_j^{(m)})$.
**Return** $\mbz^{(M)}$.
As with regular Newton's method, proximal Newton exhibits local quadratic convergence to obtain error
:::{admonition} Note
:class: warning
In practice, you may need to also implement a backtracking line search to choose the step size
Another way to arrive at the same answer is to leverage the relationship between Newton's method and iteratively reweighted least squares. We cast each Newton update as the solution to a weighted least squares problem. In the
There are also lots of speed-ups to be had by exploiting problem-specific structure. See {cite:t}friedman2010regularization for details.
The proximal methods disussed today are what run behind the scenes of modern packages for sparse linear and logistic regression. In particular, sklearn.linear_model.Lasso uses a fast coordinate descent algorithm like discussed above, and GLMNet {cite:p}friedman2010regularization uses a proximal Newton algorithm with coordinate descent for the proximal step.