I noticed that you're reporting logloss as the metric to evaluate systems, but you're not passing this information to any of the AutoML systems. Both auto-sklearn and H2O AutoML (maybe MLJar too?) have the ability to optimize and choose a leader model based on the metric which you want to evaluate, so this should be explicitly specified in a benchmark.
- H2O AutoML has two parameters that should be set when evaluating on a non-default metric. Those are
stopping_metric and sort_metric and should both be set to "logloss". More info here. By default on binary classification problems, H2O is optimized for AUC, unless you change it to logloss.
- Auto-sklearn also has a
metric argument which should be used and set to "logloss". More info here.