train.py
- trains on a set of training logs using various algorithms
- saves training models as
joblib
pickle files - predicts accuracy of the training models
- takes the following parameters:
--train_data_dir
: sets the location of the training logs (default:data/train/laptop
)
--test_data_dir
: sets the location of the testing logs (default:data/test/laptop
)
--save-dir
: set location where the joblib pickle files are saved to (default:save
)
Make sure you have a recent version of python2.7 and python pip, then install the required libraries.
pip install numpy sklearn
Create data directories.
mkdir -p data/{train,test}/laptop
Create save directory
mkdir -p save
Collect logs
find /var/log -type f -size +10k -name "*.log" 2>/dev/null | while read log do rows=$(wc -l "$log" | awk '{ print $1 }') head -$(($rows - ($rows / 10))) "$log" > data/train/laptop/"${log##*/}" tail -$(($rows / 10)) "$log" > data/test/laptop/"${log##*/}" done
Run the script
python2.7 train.py
This should give something like the following:
Training log collection => 250587 data entries Testing log collection => 27843 data entries SGDClassifier Success rate: 97.38% MultinomialNB Success rate: 98.64% BernoulliNB Success rate: 96.36% DecisionTreeClassifier Success rate: 95.26% ExtraTreeClassifier Success rate: 94.52% ExtraTreesClassifier Success rate: 99.21% LinearSVC Success rate: 99.17% NearestCentroid Success rate: 92.29% RandomForestClassifier Success rate: 99.06% RidgeClassifier Success rate: 99.16%
predict.py
- loads training models from
joblib
pickle files - predicts accuracy of the training models
- takes the following parameters:
--test_data_dir
: sets the location of the testing logs (default:data/test/laptop
)
--save-dir
: set location where the joblib pickle files are saved to (default:save
)
$ python2.7 predict.py Testing log collection => 27843 data entries SGDClassifier Success rate: 97.38% MultinomialNB Success rate: 98.64% BernoulliNB Success rate: 96.36% DecisionTreeClassifier Success rate: 95.26% ExtraTreeClassifier Success rate: 94.52% ExtraTreesClassifier Success rate: 99.21% LinearSVC Success rate: 99.17% NearestCentroid Success rate: 92.29% RandomForestClassifier Success rate: 99.06% RidgeClassifier Success rate: 99.16%
Adjust the algorithms
array to include any number of Scikit Learn algorithms that you want to run:
algorithms = [ # svm.SVC(kernel='linear', C = 1.0), # QUITE SLOW linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None), naive_bayes.MultinomialNB(), naive_bayes.BernoulliNB(), tree.DecisionTreeClassifier(max_depth=1000), tree.ExtraTreeClassifier(), ensemble.ExtraTreesClassifier(), svm.LinearSVC(), # linear_model.LogisticRegressionCV(multi_class='multinomial'), # A BIT SLOW # neural_network.MLPClassifier(), # VERY SLOW neighbors.NearestCentroid(), ensemble.RandomForestClassifier(), linear_model.RidgeClassifier(), ]