The goals / steps of this project are the following:
- Load the data set (see below for links to the project data set)
- Explore, summarize and visualize the data set
- Design, train and test a model architecture
- Use the model to make predictions on new images
- Analyze the softmax probabilities of the new images
- Summarize the results with a written report
Note: All the code is in the 'Traffic_Sign_Classifier.ipynb' notebook.
HTML version of the notebook is 'Traffic_Sign_Classifier.html'.
The data wasn't loaded to github. The data was used form here: https://s3-us-west-1.amazonaws.com/udacity-selfdrivingcar/traffic-signs-data.zip
I used numpy and basic python functionality to calculate summary statistics of the traffic signs data set:
- The size of training set is 34,799
- The size of the validation set is 4,410
- The size of test set is 12,630
- The shape of a traffic sign image is 32x32x3
- The number of unique classes/labels in the data set is 43
Here is an exploratory visualization of the data set. It is a plot showing the number of samples of each traffic sign type in the training, validation and test datasets:
It is visible that there are some classes which have more samples than others however overall distributions in the datasets look similar (whenever one class has more samples than other class in training data we can see the same relationship in validation and test datasets).
Describe how you preprocessed the image data. What techniques were chosen and why did you choose these techniques? Consider including images showing the output of each preprocessing technique. Pre-processing refers to techniques such as converting to grayscale, normalization, etc. (OPTIONAL: As described in the "Stand Out Suggestions" part of the rubric, if you generated additional data for training, describe why you decided to generate additional data, how you generated the data, and provide example images of the additional data. Then describe the characteristics of the augmented training set like number of images in the set, number of images for each class, etc.)
As a first step, the image is converted to YCrCb color space. As a second step Y channel is extracted and the image is represented as gray scale image with one color channel.
Such decision was made experimentally. Originally I was just using all thress channels of RGB color space and then tried Y channel of YCrCb color space like it was done in this paper: http://yann.lecun.com/exdb/publis/pdf/sermanet-ijcnn-11.pdf and the network performance was satisfactory using Y channel. Potentially I could try other color spaces and other channels but the performance using Y channel is good enough.
Here is an example of a traffic sign image before and after grayscaling.
As a last step, I normalized the image by subtracting mean value of all pixels and dividing by standard deviation to bring all image pixel values to the same scale with 0-mean and standard deviation of 1.
Additional data was generated by shifting, rotating and zooming an image.
Here is an example of an original image and transforms applied to it.
Original image:
Two scale transforms (1.1 and 0.9), two rotation transforms (15 degrees and -15 degrees), two shift transforms ((2, 2), (2, -2)) have been applied. As a result we have increased the size of the training dataset in 7 times having 243,593 training examples after data enhancement.
Final model consists of the following layers:
Layer | Description |
---|---|
Input | 32x32x1 Grayscale image |
Convolution 5x5 | 1x1 stride, valid padding, outputs 28x28x100 |
RELU | |
DROPOUT | 0.8 keep probability |
Max pooling | 2x2 stride, outputs 14x14x100 |
Convolution 5x5 | 1x1 stride, valid padding, outputs 10x10x150 |
RELU | |
DROPOUT | 0.8 keep probability |
Convolution 3x3 | 1x1 stride, valid padding, outputs 8x8x200 |
RELU | |
DROPOUT | 0.8 keep probability |
Max pooling | 2x2 stride, outputs 4x4x200 |
Fully connected | Input 3200 after flattening. Output 200 |
RELU | |
DROPOUT | 0.8 keep probability |
Fully connected | Input 200 after flattening. Output 84 |
RELU | |
DROPOUT | 0.8 keep probability |
Fully connected | Input 84 after flattening. Output 43 |
Adam Optimizer is used for optimization and weight updates.
Batch Size: 100
Learning Rate: 0.001
Number of Epochs: 20 normally, 5 for model which was fed with enhanced dataset.
Dropout keep probability 0.8 for training and 1.0 for validation and testing.
Describe the approach taken for finding a solution and getting the validation set accuracy to be at least 0.93. Include in the discussion the results on the training, validation and test sets and where in the code these were calculated. Your approach may have been an iterative process, in which case, outline the steps you took to get to the final solution and why you chose those steps. Perhaps your solution involved an already well known implementation or architecture. In this case, discuss why you think the architecture is suitable for the current problem.
My final model results were:
- training set accuracy of 99.8%
- validation set accuracy of 99.0%
- test set accuracy of 96.9%
The final model has been obtained after sets of experments. Below is the brief description of every experiment setup, accuracy and transition to next experiment:
- I started with LeNet model. The reason is that it classified numbers well and there was a chance that it would be able to classify german traffic signs as well which are 32x32 images as well. The only difference was that in LeNet the images have been grayscale but german traffic signs have 3 channels by default. So the model was updated to accept 32x32x3 images.
Such model gave not very good results. I lost exact acuracy numbers. But validation and test accuracy have been around 70%. I tried running LeNet on grayscale image (by converting RGB to grayscale first) which didn't produce any good results either.
Also it is worth noting that originally I had very simple normalization by simply substracting 128 and dividing by 255. - After reading this paper http://yann.lecun.com/exdb/publis/pdf/sermanet-ijcnn-11.pdf I realized that it might be worth trying enhancing number of filters in convultional layers. This model was exactly the same as previous one but first conv layer now had 100 filters and second one 200. Validation set accuracy was 97.4%. Test set accuracy was 84.5%. It is worth noting here that originally I downloaded just training and test datasets and was splitting training dataset using sklearn module to obtaine validation dataset. It is hard to explain why but here I observed much better performance on validation data than on test data. As a result I was not satisfied with such model.
- Converted an image to YCrCb first and then applying same model as above produced validation accracy of 90.1% and test set accuracy of 69.2%.
- Same model as above but used only Y channel of YCrCb.
Validation accuracy: 95.8%
Test accuracy: 84.8% - Added third convolutional layer. Updatd number of filters to be 100, 150, 200 for first, second and third conv layers respectively like in the final model.
Validation accuracy: 97.5%
Test accuracy: 87.7% - Added dropout layers like in the final model.
Validation accuracy: 98.5%
Test accuracy: 90.9% - Improved image normalization to subtract mean and divide by stddev. Also started using the dataset provided by the project description. This is the final model.
Validation accuracy: 97.9%
Test accuracy: 95.3% - Enhanced training dataset by performing image scaling, rotation and shifting.
Validation accuracy: 99.0%
Test accuracy: 96.9%
Some thoughts on the final model:
- Convolutional layer needs to have more filters which allows to capture different properties of an image. Since we have 43 classes we need to be able to capture diverse set of features. And having 6 and 16 filters like in LeNet is probably not enough to capture different features.
- Dropout layer allows to train the network in more robust way by creating redundant connections which are activated under similar circumstances. That is acheived by randomly turning off some percentage of connections during training. Dropout layers should also reduce overfitting.
- Ideally if I would be building production-ready model, I would experiment much more with model parameters and potentially find simplified model which still produces great results with the purpose of reducing training time and reducing predcition time.
Here are five German traffic signs that I found on the web:
Images are reshaped to 32x32x3 and preprocessed using same logic as trainign/validation/test images.
Such images should be simple to classify in general however they have interesting properties like watermarks of websites which distribute them, background objects.
The accuracy is 100%.
Image | Prediction |
---|---|
No passing | No passing |
Road work | Road work |
Children crossing | Children crossing |
End of no passing | End of no passing |
Wild animals crossing | Wild animals crossing |
In general model outputs pretty high probabilities for correct class. Which is vry good. Here is the output of top 5 probabilities for each test image:
For image # 1 the top 5 answers are:
('No passing', 1.0)
('End of no passing', 6.2027712e-09)
('Speed limit (120km/h)', 9.8718078e-10)
('Slippery road', 4.6131909e-10)
('End of all speed and passing limits', 3.2826991e-10)
Correct answer is: No passing
For image # 2 the top 5 answers are:
('Road work', 1.0)
('Dangerous curve to the right', 7.0088174e-10)
('Slippery road', 2.6380976e-12)
('Beware of ice/snow', 5.7867475e-13)
('Double curve', 3.8999544e-13)
Correct answer is: Road work
For image # 3 the top 5 answers are:
('Children crossing', 0.99859339)
('Beware of ice/snow', 0.00048915728)
('Dangerous curve to the right', 0.00031584653)
('Slippery road', 0.00028684994)
('Traffic signals', 8.5214539e-05)
Correct answer is: Children crossing
For image # 4 the top 5 answers are:
('End of no passing', 0.65861064)
('End of all speed and passing limits', 0.30596185)
('Priority road', 0.020915516)
('End of no passing by vehicles over 3.5 metric tons', 0.0036699278)
('End of speed limit (80km/h)', 0.0032210965)
Correct answer is: End of no passing
For image # 5 the top 5 answers are:
('Wild animals crossing', 1.0)
('Double curve', 1.9002666e-08)
('Road work', 3.1246654e-09)
('Speed limit (50km/h)', 3.2554637e-10)
('No passing for vehicles over 3.5 metric tons', 1.7079377e-10)
Correct answer is: Wild animals crossing
All 5 images downloaded from internet have been predicted correctly. It is amazing to see that 4 of 5 predictions have softmax probability 1.0 and the fifth one 'end of no passing' has the probability of 65% and second probability is 30% for 'End of all speed and passing limits'.
Visualization of some trained neural netowrk layers have been performed based on activations of one test image to understand what kind of features does the network capture. Only layers before first pooling layer have been visualized as after first pooling layer the layer activations are not very representative as they encode relationship between activations of first convolutational layer (and subsequent additional layers: relu, dropout, pooling).
-
Visualization of the first convolutional layer (only 49 out of 100 filters are shown as plotting library can't visualize more):
As we can see the network seem to have learned some filters which transform input image and apply some kind of 'edge detection' filters but with different properties. -
Visualization of the relu activation layer after first convolutional layer:
Activation relu layer removes a lot of pixel values (negative ones become 0) which seems to contribute to filtering which looks like real edge detection but with different gradient direction and magnitude. -
Visualization of the max pool layer after first activation layer:
Pooling layer just downsamples activations from previous layer. Preserving detected edges and features but reducing number of pixels which basically reduces amount of imfromation for subsequent layers. Bascially it allows to focus on important 'features' removing redundant data.