Let's turn our attention back to models for sequential data for the next few lectures. With all the excitement around large language modeling, this is a timely topic.
Consider a sequence of observations
Autoregressive models are well-suited to sequential modeling since they make forward generation or forecasting easy. As long as we have access to the conditionals, we can sample forward indefinitely.
The question is, how should we parameterize these conditional distributions? It looks challenging since each one takes in a variable-length history of previous observations.
Recurrent Neural Networks (RNNs) are autoregressive models in which the conditional distributions are functions of a finite-dimensional hidden state,
:::{admonition} Defining Property of an RNN :class: tip The conditional distribution over the next observation is a function of hidden state. The hidden state is updated recursively, and its size is fixed regardless of the sequence length. :::
The standard, "vanilla" RNN consists of a linear-nonlinear state update. For example, \begin{align*} f(\mbh_t, \mbx_t; \mbtheta) &= \tanh \left(\mbW \mbh_t + \mbB \mbx_t \right), \end{align*} where
-
$\mbW \in \reals^{K \times K}$ are the dynamics weights, -
$\mbB \in \reals^{K \times D}$ are the input weights, -
$\tanh(\cdot)$ is the hyperbolic tangent function.
:::{admonition} Hyperbolic Tangent and the Logistic Function :class: dropdown The hyperbolic tangent function equivalent is typically written as, \begin{align*} \tanh(a) &= \frac{e^a - e^{-a}}{e^a + e^{-a}}. \end{align*} We can rewrite it as, \begin{align*} \tanh(a) &= 1 - \frac{2e^{-a}}{e^a + e^{-a}} \ &= 1 - \frac{2}{e^{2a} + 1} \ &= 1 - 2 \sigma(-2a) \ &= 1 - 2 (1 - \sigma(2a)) \ &= 2 \sigma(2a) - 1. \end{align*} Thus, we see that the the hyperbolic tangent is simply a scaled and shifted logistic function. :::
The "read-out" of a vanilla RNN is typically a simple linear or generalized linear model, depending on the type of observations. For example, \begin{align*} g(\mbh_t, \mbx_t; \mbtheta) &= \mbC \mbh_t + \mbd, \end{align*} where
-
$\mbC \in \reals^{D \times K}$ are the read-out weights and -
$\mbd \in \reals^D$ is the bias.
Let
From a machine learning perspective, RNNs are useful function approximators for sequential data. From a neuroscience perspective, they have a long history as theoretical models of neural computation.
In such models, the hidden state
The dynamics weights correspond to synaptic connections, with positive weights as excitatory synapses and negative weights as inhibitory. When a presynaptic neuron spikes, it induces an electrical current in postsynaptic neurons. Under this interpretation, the activation
As a neuron receives input current, its voltage steadily increases until it reaches a threshold, at which point the voltage spikes and the neuron fires an action potential. These spikes induce currents on downstream neurons, as described above. After a cell fires, there is a short refractory period before the neuron can spike again. Thus, there is an upper bound on firing rates, which the hyperbolic tangent is meant to capture.
Artificial RNNs are trained using stochastic gradient descent (SGD) to minimize the negative log likelihood,
\begin{align*}
\cL(\mbtheta)
&= - \sum_{t=1}^T \log p(\mbx_t \mid \mbx_{1:t-1}) \
&= - \sum_{t=1}^T \log p(\mbx_t \mid g(\mbh_t; \mbtheta)) \
&= - \sum_{t=1}^T \log p(\mbx_t \mid g(f(\mbh_{t-1}, \mbx_{t-1}; \mbtheta); \mbtheta)) \
&= - \sum_{t=1}^T \log p(\mbx_t \mid g(f(\cdots f(\mbh_{1}, \mbx_{1}; \mbtheta) \cdots, \mbx_{t-1}; \mbtheta); \mbtheta)).
\end{align*}
Now, you would simply use automatic differentiation to compute the necessary gradients to minimize this loss, but we can gain some insight by working them out manually. With some algebra, we can show that the Jacobian of the loss with respect to the dynamics weights (other parameters are similar) is,
\begin{align*}
\frac{\partial \cL(\mbtheta)}{\partial \mbW} &= \sum_{t=1}^T \frac{\partial \cL(\mbtheta)}{\partial \mbh_t} \frac{ \partial \mbh_t}{\partial \mbW}
\end{align*}
We need the Jacobian of the loss with respect to the hidden states. These can be computed recursively by backpropagation through time (BPTT),
\begin{align*}
\frac{\partial \cL(\mbtheta)}{\partial \mbh_t}
&= \frac{\partial \cL(\mbtheta)}{\partial \mbh_{t+1}} \frac{\partial \mbh_{t+1}}{\partial \mbh_t} - \frac{\partial \log p(\mbx_t \mid g(\mbh_t; \mbtheta))}{\partial \mbh_t}
\end{align*}
For a vanilla RNN, the Jacobian of the next state with respect to the current state is,
\begin{align*}
\frac{\partial \mbh_{t+1}}{\partial \mbh_t}
&= \diag \left(1 - \mbh_{t+1}^2 \right) \mbW
\end{align*}
since
:::{admonition} BPTT is a linear dynamical system
The "state" of the BPTT recursions is the Jacobian, or equivalently its transpose, the gradient
:::{admonition} Biological Plausibility
:class: tip, dropdown
Backpropagation is the most effective way we know of to train artificial RNNs, so it's reasonable to think that the brain might be using a similar learning algorithm. Unfortunately, it's not clear how the backpropagation through time algorithm could be implemented by a neural circuit. The multiplication by
When the Jacobians
Vanishing gradients are especially problematic when
One way to combat the vanishing gradient problem is by modifying the RNN architecture. Architectures like long short-term memory (LSTM) networks achieve this via gated units.
An LSTM has internal (aka "cell") states
The affine term is determined by,
\begin{align*}
\mbb_t &= \mbg_t \odot \mbi_t \
\mbg_t &= \sigma(\mbW^{(g)} \mbh_{t-1} + \mbB^{(g)} \mbx_{t-1}) \
\mbi_t &= \sigma(\mbW^{(i)} \mbh_{t-1} + \mbB^{(i)} \mbx_{t-1})
\end{align*}
The vector
Finally, the hidden states
We can think of an LSTM as an RNN that operates on an extended state
There are many variants of gated RNNs. Besides the LSTM, the most commonly used in the gated recurrent unit (GRU), which has a slightly simplified architecture. See {cite:t}goodfellow2016deep (ch. 10) for more detail.
We motivated RNNs from an autoregressive modeling persepective, but they are useful in other sequential data settings as well. For example, suppose we want to predict the sentiment of a review
Sometimes we want to map one sequence
In the example above, one challenge is that the hidden state
As with deep neural networks that stack layer upon layer, we can stack RNN upon RNN to construct a deeper model. In such models, the outputs of one layer,
It turns out, we've already seen an RNN in this class!
We presented Hidden Markov Models (HMMs) as latent variable models with hidden states
Consider an HMM with categorical emissions,
\begin{align*}
p(\mbx_t \mid z_t) &= \mathrm{Cat}(\mbx_t \mid \mbc_{z_t}),
\end{align*}
where
For a categorical HMM, we can write the likelihoods as
This formulation suggests that we could estimate the parameters of an HMM by directly maximizing the log likelihood, \begin{align*} \cL(\mbtheta) &= \log p(\mbx_{1:T}; \mbtheta) = \sum_{t=1}^T \log p(\mbx_t \mid \mbx_{1:t-1}; \mbtheta). \end{align*} With automatic differentiation at our disposal, this sounds like it might be a lot easier!
Let's pursue this idea a little further. First, we'd prefer to do unconstrained optimization, so let's parameterize the model in terms of
:::{admonition} Softmax is translation invariant
Note that the softmax operation is translation invariant,
\begin{align*}
\mathrm{softmax}(\log \mbc_k) &=
\mathrm{softmax}(\log \mbc_k + a)
\end{align*}
for any constant
To maximize the likelihood with gradient ascent, we need the Jacobians,
First, note that the posterior distribution in an HMM can be written as an exponential family,
\begin{align*}
p(\mbz_{1:T} \mid \mbx_{1:T}; \mbtheta)
&= \exp \left{ \log p(\mbz_{1:T}, \mbx_{1:T}; \mbtheta) - \log p(\mbx_{1:T}; \mbtheta) \right} \
&= \exp \left{
\log p(z_1; \mbpi_0)
+ \sum_{t=2}^T \log p(z_t \mid z_{t-1}; \mbP)
+ \sum_{t=1}^T \log p(\mbx_t \mid z_{t}; \mbC)
- \log p(\mbx_{1:T}; \mbtheta)
\right} \
&= \exp \bigg{
\sum_{k=1}^K \langle \bbI[z_1 =k], \log \pi_{0,k} \rangle \
&\hspace{6em}
+ \sum_{t=2}^T \sum_{i=1}^K \sum_{j=1}^K \langle \bbI[z_{t-1}=i \wedge z_{t} = j], \log P_{i,j} \rangle \
&\hspace{12em}
+ \sum_{t=1}^T \sum_{v=1}^V \sum_{k=1}^K \langle \bbI[x_t = v \wedge z_t = k], \log C_{v,k} \rangle \
&\hspace{18em} - A(\mbtheta)
\bigg}
\end{align*}
where the log marginal likelihood
Recall that for exponential family distributions, gradients of the log normalizer yield expected sufficient statistics. Thus, \begin{align*} \frac{\partial A(\mbtheta)}{\partial \log C_{v,k}} &= \sum_{t=1}^T \bbI[x_{t} = v] \cdot \bbE_{p(\mbz_{1:T} \mid \mbx_{1:T}; \mbtheta)} \left[\bbI[z_{t}=k] \right] \end{align*} and \begin{align*} \frac{\partial A(\mbtheta)}{\partial \log P_{i,j}} &= \sum_{t=2}^T \bbE_{p(\mbz_{1:T} \mid \mbx_{1:T}; \mbtheta)} \left[ \bbI[z_{t-1}=i \wedge z_{t} = j] \right] \end{align*} The gradients are essentially the posterior marginals and pairwise marginals we computed in the EM algorithm!
In the M-step of EM, we solve for the parameters that satisfy the constrained optimality conditions, whereas in SGD we just take a step in the direction of the gradient. EM tends to converge must faster in practice for this reason.
What is the cost of evaluating an RNN and computing its gradients? Both the forward computation of the states and the backward computation of the gradients are
These costs are comparble to those of inference in an HMM, as we expect from the discussion above. However,