Convolution Neural Networks

Convolution Neural Networks Intutions:

Insights:

Using a combination of Weight Matrix and Activation (Non-Linear) function we can learn any arbitrary function
Convolution is a subset of Weight Matrix Multiplication operation with the input layer - thus it can also learn any arbitrary function. Also convolutions are great with Image Feature learning. Thus can learn any arbitrary image features. Great!!!
Convolution are position invariant - but at the same time they care of the relative position or lacation
While doing MaxPooling, we are reducing number of information (generally half) - thus number of filter increases after Maxpooling layer.
Weight Initialization in CNN
Softmax Intuition: we want to learn one hot encoded output, thus softmax will push the output probs to very high values and very low values (they may or not may be closer to 1), thus making learning faster. Also log loss layer (cross entropy loss) function is log based function, having exponential function in the softmax max (exp(i) / exp(all is)) makes sense due as they are inverse of each other
No Filter in CNN can be identical, as it is not optimal to learn the same filter. SGD wont allow it.
Dealing with Large Images - foveation can be used (inspired from human eye) to do divide the image into low resolution and high resolution regions. Also we can build something like attention model (using RNN?? active research) - to refocus the attention of the network to do infer or train or learn weights.
Retraining: Depending upon how the new features we want to learn differ from the current model - we need to retrain. for simple catsdogs model from imagenet - we can simply retrain the last dense layer. for statefarm - we can retrain all the dense layer. convolution layer retraining is not required as convolution layer basically learns spatial location - so no retraining is required. While retraining, we do not start with random weights - we start with imagenet optimal weights.
Computational Complexity Vs Memory Utilization: In convolution layer, for 14x14 image, with 3x3x (512 filters) - lot of computation to calculate. But memory requirement is not large. For, Dense Layer, 4096 x 4096 - total number of weights - thus lot of memory utilization and less computation.
Underfitting
Overfitting : Overfitting is bad, if the accuracy goes down.
1. Add More data
2. data augmentation : simply retraining the DNN with rotated, zoomed, moved images - such that it learns this kind of features - while validating the images. Channel Augmnetation - for color changes - as images might have different colors.
3. use architecture that generalize well
4. use regularization - L2 regularation. lossfunction += small_number x (squared sum of weights) : it will punish for having larger weights.
5. use dropout
6. reduce architecture complexity
Dropout: Normally added to Dense or Fully connected layer. but dropout can also be performed on the convolutional layer. we dont want to dropout in earlier layer - as we are loosing for the all the next layers.
Batch Normalization : Intution from Machine Learning input normalization. If any of the weight is very large - one of the layer will have big number - whole model become skewed. some exponents will become very large. Normalize at activations as well - does (mean substract and variance diviation wont work) - SGD wont allow - if SGD decides to do so. Batch Norm Trick: Why? 10 times faster (as learning rate is 10 times faster) ++ Reduces overfitting
1. Normalize the intermediate layers similar to input layers
2. Add two more paramters. 1. 1 parameters is multiplied to all the activations (to set arbitrary standard deviation) 2. 1 parameters to add to all the activations to set the arbitrary mean
3. with both the normalization, model now knows that it can rescale all the weights while doing SGD

1 x 1 Convolution Great Post

They are simply used for dimensionalality reduction. 1x1 convolution - feature pooling technique. It is sum pooling of features across various channels/feature maps. eg. for 200x200x50 tensor, with kernel 1x1x20 : we can add the feature maps for all 50 channels for each feature map and get tensor of reduced dimension i.e. 200x200x20. Although, 1x1 conv is linear, it is followed by non linear activation layer like ReLU. Transformation is learned through SGD. Also it suffers from less overfitting due to smaller kernel i.e. 1x1 In modern architecture, it can be used to make network more wider rather than deeper. In GoogleNet paper, they had 1x1 conv operations before 3x3 or 5x5 conv operations to reduce dimensions. Its not simple stacking of two step conv, but a conv followed by non linear layer. 1x1 conv can not be initial conv, as initial conv needs larger kernel with large receptive field to capture local spatial information. In Summary, 1x1 conv is like cross-channel parametric pooling.

Convolution Inuition and basics:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convolution Neural Networks