Algorithm selection benchmarks experimental setup

Disclaimer: While we have structured in the experimental setup and the evaluation procedure in a clean and reasonable manner - both from an algorithm selection and machine learning perspective -, they are meant as a guide. The results are not meant to demonstrate the state of the art or the best possible results. Much can be varied and probably be improved, regarding the preprocessing or modelling itself. The repository now offers a structured scientific basis for evaluating exactly such kinds of questions.

Software used

The experiments use the LLAMA algorithm selection system and the mlr machine learning toolbox.

Data preprocessing

The data in each algorithm selection task is preprocessed as follows.

For repeated feature calculations, the mean value is taken across all repetitions.
Completely constant features are removed. One cannot learn from them.
NA feature values are imputed by the mean value across all non-NA values for the feature. Note that this is a basic naive approach which might likely be improved upon.
Any tasks that have repeated performance measurements are discarded. This currently applies to no tasks. (This will be handled very soon in a better manner and example tasks where this occurs will be included).
If the performance measure is runtime and the feature computation presolves the instance, the cost of all algorithms is set to the cost of all feature steps up to the presolving one (note that feature steps are specified in their order of execution). That means that the individual costs of all algorithms are discard for this instance and they all incur the same cost.
If the performance measure is runtime, the cost of feature computation is added to the algorithms' performances, unless we are considering only the baselines which do not incur feature costs.
The success for each algorithm on each problem instance is computed by checking whether the run status was OK and the performance value is not NA. If the performance measure is runtime and a cutoff value has been specified, any algorithm runs that exceed the cutoff when taking the cost of feature computation into account are marked as not successful.
The performance values for all not successful algorithm runs are imputed by the cutoff value if present and the performance metric is runtime, else ten times the maximum performance.
The algorithm selection task data structure is converted to a LLAMA data frame.

The cross-validation splits defined in the respective tasks are preserved in the LLAMA data frame. That is, the model building functions will use these splits for training and testing.

Machine learning models used in experiments

We use the following classification, regression and clustering approaches in our experiments.

classification from Weka
- meta/AdaBoostM1
- bayes/BayesNet
- lazy/IBk
- rules/OneR
- trees/RandomTree
- trees/J48
- rules/JRip
classification (from mlr)
- classif.ctree
- classif.ksvm
- classif.naiveBayes
- classif.randomForest
- classif.rpart
regression (from mlr)
- regr.earth
- regr.lm
- regr.randomForest
- regr.rpart
clustering (from Weka)
- EM
- FarthestFirst
- SimpleKMeans

The parameters of all algorithms were left at their default values. (Note that this is a somewhat dubious approach for models like SVMs. Our experimental setup will later include necessary automatic hyperparameter tuning for such models.) For the clustering algorithms, the (maximum) number of clusters was set to the number of algorithms in the respective task.

Baseline algorithms

For comparison purposes, we use the following baseline algorithms.

virtual best solver
single best solver by number of times it had the best performance
single best solver by PAR10 score
single best solver by number of successful solver runs

The baseline algorithms are not cross validated which gives an optimistic performance estimation of the single best solvers.

Model building in LLAMA

The model building functions in LLAMA were also used with their default parameters. That is, the regression model builder for example allows to stack multiple models -- this was not used.

In all cases, LLAMA's normalize() function was used to normalize the feature vectors before passing them to the machine learning models.

Model evaluation

For each trained model, we record the mean number of successful solver runs in the predictions, the mean PAR10 score for the predictions, and the mean misclassification penalty of the predictions. The averaging is done for all predictions of a certain test set in the cross-validation and then across all test sets (which are usually 10, as we normally evaluate by 10-fold cross-validation).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!