-
Notifications
You must be signed in to change notification settings - Fork 7
Algorithm selection benchmarks experimental setup
Disclaimer: While we have structured in the experimental setup and the evaluation procedure in a clean and reasonable manner - both from an algorithm selection and machine learning perspective -, they are meant as a guide. The results are not meant to demonstrate the state of the art or the best possible results. Much can be varied and probably be improved, regarding the preprocessing or modelling itself. The repository now offers a structured scientific basis for evaluating exactly such kinds of questions.
The experiments use the LLAMA algorithm selection system and the mlr machine learning toolbox.
The data in each algorithm selection task is preprocessed as follows.
- For repeated feature calculations, the mean value is taken across all repetitions.
- Completely constant features are removed. One cannot learn from them.
- NA feature values are imputed by the mean value across all non-NA values for the feature. Note that this is a basic naive approach which might likely be improved upon.
- Any tasks that have repeated performance measurements are discarded. This currently applies to no tasks. (This will be handled very soon in a better manner and example tasks where this occurs will be included).
- If the performance measure is runtime and the feature computation presolves the instance, the cost of all algorithms is set to the cost of all feature steps up to the presolving one (note that feature steps are specified in their order of execution). That means that the individual costs of all algorithms are discard for this instance and they all incur the same cost.
- If the performance measure is runtime, the cost of feature computation is added to the algorithms' performances, unless we are considering only the baselines which do not incur feature costs.
- The success for each algorithm on each problem instance is computed by checking whether the run status was OK and the performance value is not NA. If the performance measure is runtime and a cutoff value has been specified, any algorithm runs that exceed the cutoff when taking the cost of feature computation into account are marked as not successful.
- The performance values for all not successful algorithm runs are imputed by the cutoff value if present and the performance metric is runtime, else ten times the maximum performance.
- The algorithm selection task data structure is converted to a LLAMA data frame.
The cross-validation splits defined in the respective tasks are preserved in the LLAMA data frame. That is, the model building functions will use these splits for training and testing.
We use the following classification, regression and clustering approaches in our experiments.
-
classification from Weka
- meta/AdaBoostM1
- bayes/BayesNet
- lazy/IBk
- rules/OneR
- trees/RandomTree
- trees/J48
- rules/JRip
-
classification (from mlr)
- classif.ctree
- classif.ksvm
- classif.naiveBayes
- classif.randomForest
- classif.rpart
-
regression (from mlr)
- regr.earth
- regr.lm
- regr.randomForest
- regr.rpart
-
clustering (from Weka)
- EM
- FarthestFirst
- SimpleKMeans
The parameters of all algorithms were left at their default values. (Note that this is a somewhat dubious approach for models like SVMs. Our experimental setup will later include necessary automatic hyperparameter tuning for such models.) For the clustering algorithms, the (maximum) number of clusters was set to the number of algorithms in the respective task.
For comparison purposes, we use the following baseline algorithms.
- virtual best solver
- single best solver by number of times it had the best performance
- single best solver by PAR10 score
- single best solver by number of successful solver runs
The baseline algorithms are not cross validated which gives an optimistic performance estimation of the single best solvers.
The model building functions in LLAMA were also used with their default parameters. That is, the regression model builder for example allows to stack multiple models -- this was not used.
In all cases, LLAMA's normalize() function was used to normalize the feature vectors before passing them to the machine learning models.
For each trained model, we record the mean number of successful solver runs in the predictions, the mean PAR10 score for the predictions, and the mean misclassification penalty of the predictions. The averaging is done for all predictions of a certain test set in the cross-validation and then across all test sets (which are usually 10, as we normally evaluate by 10-fold cross-validation).