- Models with a strong biological inspiration.
- Composed by a set of units (neurons) that are connected. These connections have an associated weight.
- Each unit has an activation level as well as means to update this level.
- Some units are connected to the outside world. We have input and output neurons.
- Learning within ANNs consists of updating the weights of the network connections
- Each unit has a very simple function: receive the input impulses and calculate its ouput as a function of these impulses.
- This calculation is divided in two parts:
- a linear combination of the inputs
- a (typically) non-linear activation function
- network with an input layer and an output layer
- It learns by updating the weights through delta rule with learning rate η
- Perceptrons are limited to linearly separable functions.
- used to determine the output of each node of the neural network
- Linear
- Non-linear: most commonly used as it allows the model to generalize or adapt with variety of data
- most popular algorithm for learning ANNs
- t has similarities with the learning algorithm used in perceptron networks
- Intuition:
- each unit is responsible for a certain fraction of the error in the output nodes to which it is connected
- thus, the error is divided according to the weight of the connection between the respective hidden and output units, thus propagating the errors backwards
- Backpropagation computes the gradient in weight space of a feedforward neural network, with respect to a loss function
- Algorithm:
- Initialize network weights (often small random values)
- Do
- For each example in training set
- predict the output
- calculate the prediction error by a loss function
- compute δh for all the weights from hidden layer to output layer
- compute δi for all the weights from input layer to hidden layer
- update network weights
- For each example in training set
- Until it converges
- Return the Network
- Stopping Criteria:
- maximum number of iterations
- error based on the training set (when the error in the training set is below a certain limit.)
- error based on a validation set (independent of the training set)(when the error on the validation set has reached a minimum)
- The number of nodes in the hidden layer
- few nodes: underfitting
- many nodes: overfitting
- there are no criteria for defining the number of nodes in the hidden layer
- Effect of learning rate (sets the size of the steps to obtain the direction of maximum descendent)
- a small learning rate has the effect of learning times higher
- a high learning rate may lead to non-convergence
- Optimal number of hidden neurons
- too many hidden neurons: you get an overfit, training set is memorized, thus making the network useless on new data sets
- not enough hidden neurons: network is unable to learn problem concept
- Overtraining: too much examples, the ANN memorizes the examples instead of the general idea
- Network Structure
- number of layers
- number of neurons in each layer
- weights initialization
- activation function
- Training Algorithm
- learning rate
- number of epochs
- early stopping criterion
- weight decay (a regularization on the network weights)
- Data should be standarized
- Missing values in input features may be represented as zeros, which do not influence the neural net training process.
- Use one-hot encoding, there are M output neurons (1 per class)
- For each case, the class with the highest probability value
- Initialize the weights with small random values [−0.05,0.05]
- Shuffle the training set between epochs, i.e. change the sequence of the examples
- The learning rate must start with a high value that decreases progressively
- Train the network several times using different initialization of the weights
- Tolerance of noisy data
- Ability to classify patterns on which they have not been trained
- Successful on a wide range of real-world problems
- Algorithms are inherently parallel
- Long training times
- Resulting models are essentially black boxes
- Deep learning = Deep neural networks
- Feedforward neural networks
- Neurons typically use the ReLU or sigmoid activation functions
- Weight multiplications are replaced by convolutions (filters)
- Change of paradigm: can be directly applied to the raw signal, without computing first ad hoc features
- Features are learnt automatically
Convolution: mathematical operation between 2 matrices
- Reduced amount of parameters to learn (local features)
- More efficient than dense multiplication
- Specifically thought for images or data with grid-like topology
- Convolutional layers are equivariant to translation
- Currently state-of-the-art in several tasks