-
Notifications
You must be signed in to change notification settings - Fork 3
Recurrent Neural Networks
- Why we need RNNs:
- Variable length Sequence
- Long-term dependency
- Stateful representation
- Memory
Types of RNNs:
They are capable of capturing information from left to right and from right to left. It has two RNNs for capturing the information from either directions. eg. in speech recognition, to understand a phoneme (distinct unit of sound) at input "i" we need to gather information from "i+1" and "i-1" ( we need past as well as future steps)
Single fixed length vector spitting out series of output can be used for Image captioning. From a Image --> Use CNN or MLP to extract features of an Image --> Extract feature vector --> use RNN to generate proper caption i.e. word by word. Word_next = Captioning ( Current_word, Image Feature Vector)
Need to map a sequence to another sequence of different length. Encoder-Decoder model / Seq2Seq Model has 2 RNNs. Encoder process input sequence and does not emit output at each step. It captures input sequence, one word at a time and tries to capture task relevant information from sequence i.e. internal state. Final hidden state of the encoder is task relevant summary of input sequence - called a context or thought vector. context act as only input to the decoder - initial state of the decoder can be a function of context or context can be connected to all the hidden states of the decoder. Both the hyperparameters of encoder and decoder can be different.
Depth of a RNN - equal to number of time steps. With more hidden layers, we can stack RNNs to get deep RNNs for a input sequence. Between hidden to hidden connection with weight matrix - typically we have non-linear transformations i.e. CNN or MLP to learn higher level information. Deeper RNNs takes more to train.
Increasing number of layers -- not the time steps at each time step is called stacked RNNs. Allows network to capture higher level information in the sequence and maintain them in state. At each time step, we have number of layers states.
Based on the type of "CELL". Each "CELL" have unique gating mechanism i.e. how to control flow of the information from input to current state, from previous step to current state, and from current state to output. The shape of W, U and V will now have 3 layers of Weight Matrix (satisfying the input matrix dimensions for multiplication)
in Seq2Seq framework entire information is encoded into fixed length vector i.e. context. With length of sequence getting larger, we are loosing information. Attention mechanism allows decoder of the se2seq arch to look at the input sequence while decoding. Thus encoder does not have to encoder every useful information from the input. Thus at each time step of the decoder, a distinct context vector "Ci" is generated for word "Yi". "Ci" is weighted sum of hidden states of the encoder. Contribution of each encoder hidden state is determined by alignment model parameter (which is also trained with the model). As each output sequence word is aligned to different parts of the input sequence. Thus alignment model will tell the measure of how well the output at position "i" match with the inputs at around position "j". Based on the alignment model we take weighted sum of input contexts (hidden states) to generate each word of the output sequence.