-
Notifications
You must be signed in to change notification settings - Fork 0
Basic submissions
The benchmark provided by the competition uses a flat prior - every class is as likely as any other. However, some critters are more common than others, so the training data has a different number of training images for each class.
Hopefully the training and test data is stratified, so the proportion of images in each class is the same in both. In which case, we should do best if our prior probabilities for each class are the same as the proportion of training images in the class.
Using a test submission, we can see using the training distribution as our prior gives a better score than the uniform distribution.
Uniform Probability Benchmark 4.795791 (training=4.795791)
Training Distribution Prior 4.168923 (training=4.164090)
We can't tell if this is the best prior, or if the test distribution differs from the training distribution. However if we compare Kaggle's public test logloss scores with what we would score using the same priors on the training data, we find the uniform distribution scores the same and the stratified prior is very close.
This suggests the training and test datasets have very similar class distributions.
I trained with LogisticRegression (LR) and SupportVectorMachine (SVM) on width and height image attributes.
LR took about 10 minutes to 10-fold cross-validate (CV) and train the full model, whereas SVM took a couple of hours. SVM is two orders of magnitude slower than LR because we need to submit probabilities for each class and SVM is ill-suited to this. See here: http://scikit-learn.org/stable/modules/svm.html#scores-probabilities
SVM width, height 3.621177 (CV=3.611137)
LR width, height 3.732008 (CV=3.715337)
LR numpixels, aspectratio 3.796726 (CV=3.775329)
I thought we might do better to use numpixels and aspectratio, but this was worse (with LogisticRegression anyway).
Threw a bunch of the basic attributes I wrote in with width and height and ran with LogisticRegression again. (Beat our previous top score, which was using random forest on resized, flattened images.)
LR width height mean stderr propwhite propbool propblack 3.180800 (CV=3.161987)
We can obviously do better than this with a better choice of image attributes. I'm thinking median and quartile range would be good. Standard error is a poor choice - standard deviation is probably better.
CV seems to be pretty reliable as a slight under-estimator of logloss score. We could investigate these basic attribute models thoroughly without needing to submit each model.