You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: h2o-docs/src/product/automl.rst
+26-26Lines changed: 26 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,49 +3,49 @@
3
3
:scale:50%
4
4
:align:center
5
5
6
-
H2O AutoML: Automatic Machine Learning
6
+
H2O AutoML: Automatic machine learning
7
7
==================================
8
8
9
-
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. The first steps toward simplifying machine learning involved developing simple, unified interfaces to a variety of machine learning algorithms (e.g. H2O).
9
+
In recent years, the demand for machine learning experts has outpaced supply, despite a surge of people entering the field. To address this gap, significant progress has been made in developing user-friendly machine learning software that non-experts can use. The initial steps toward simplifying machine learning involved creating simple, unified interfaces for a variety of machine learning algorithms, such as H2O.
10
10
11
-
Although H2O has made it easy for non-experts to experiment with machine learning, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular are notoriously difficult for a non-expert to tune properly. In order for machine learning software to truly be accessible to non-experts, we have designed an easy-to-use interface which automates the process of training a large selection of candidate models. H2O's AutoML can also be a helpful tool for the advanced user, by providing a simple wrapper function that performs a large number of modeling-related tasks that would typically require many lines of code, and by freeing up their time to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.
11
+
Although H2O has made it easier for non-experts to experiment with machine learning, a fair bit of knowledge and background in data science is still required to produce high-performing models. Deep neural networks, in particular, are notoriously difficult for a non-expert to tune properly. To make machine learning software truly accessible to non-experts, we have designed an easy-to-use interface that automates the process of training a large selection of candidate models. H2O’s AutoML is also a helpful tool for advanced users. It provides a simple wrapper function that performs many modeling-related tasks, typically requiring extensive code, freeing up time to focus on other data science tasks such as datapreprocessing, feature engineering, and model deployment.
12
12
13
13
H2O's AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.
14
14
15
15
H2O offers a number of `model explainability <http://docs.h2o.ai/h2o/latest-stable/h2o-docs/explain.html>`__ methods that apply to AutoML objects (groups of models), as well as individual models (e.g. leader model). Explanations can be generated automatically with a single function call, providing a simple interface to exploring and explaining the AutoML models.
16
16
17
17
18
-
AutoML Interface
18
+
AutoML interface
19
19
----------------
20
20
21
-
The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained. Below are the parameters that can be set by the user in the R and Python interfaces. See the `Web UI via H2O Wave <#web-ui-via-h2o-wave>`__ section below for information on how to use the H2O Wave web interface for AutoML.
21
+
The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column, and optionally specify a time constraint or limit on the number of total models trained. Below are the parameters that can be set by the user in the R and Python interfaces. See the `Web UI via H2O Wave <#web-ui-via-h2o-wave>`__ section below for information on how to use the H2O Wave web interface for AutoML.
22
22
23
23
In both the R and Python API, AutoML uses the same data-related arguments, ``x``, ``y``, ``training_frame``, ``validation_frame``, as the other H2O algorithms. Most of the time, all you'll need to do is specify the data arguments. You can then configure values for ``max_runtime_secs`` and/or ``max_models`` to set explicit time or number-of-model limits on your run.
24
24
25
-
Required Parameters
25
+
Required parameters
26
26
~~~~~~~~~~~~~~~~~~~
27
27
28
-
Required Data Parameters
28
+
Required data parameters
29
29
''''''''''''''''''''''''
30
30
31
31
- `y <data-science/algo-params/y.html>`__: This argument is the name (or index) of the response column.
32
32
33
33
- `training_frame <data-science/algo-params/training_frame.html>`__: Specifies the training set.
34
34
35
-
Required Stopping Parameters
35
+
Required stopping parameters
36
36
''''''''''''''''''''''''''''
37
37
38
-
One of the following stopping strategies (time or number-of-model based) must be specified. When both options are set, then the AutoML run will stop as soon as it hits one of either When both options are set, then the AutoML run will stop as soon as it hits either of these limits.
38
+
One of the following stopping strategies (time or number-of-model based) must be specified. When both options are set, the AutoML run will stop as soon as it reaches either of these limits.
39
39
40
40
- `max_runtime_secs <data-science/algo-params/max_runtime_secs.html>`__: This argument specifies the maximum time that the AutoML process will run for. The default is 0 (no limit), but dynamically sets to 1 hour if none of ``max_runtime_secs`` and ``max_models`` are specified by the user.
41
41
42
42
- `max_models <data-science/algo-params/max_models.html>`__: Specify the maximum number of models to build in an AutoML run, excluding the Stacked Ensemble models. Defaults to ``NULL/None``. Always set this parameter to ensure AutoML reproducibility: all models are then trained until convergence and none is constrained by a time budget.
43
43
44
44
45
-
Optional Parameters
45
+
Optional parameters
46
46
~~~~~~~~~~~~~~~~~~~
47
47
48
-
Optional Data Parameters
48
+
Optional data parameters
49
49
''''''''''''''''''''''''
50
50
51
51
- `x <data-science/algo-params/x.html>`__: A list/vector of predictor column names or indexes. This argument only needs to be specified if the user wants to exclude columns from the set of predictors. If all columns (other than the response) should be used in prediction, then this does not need to be set.
@@ -60,7 +60,7 @@ Optional Data Parameters
60
60
61
61
- `weights_column <data-science/algo-params/weights_column.html>`__: Specifies a column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
62
62
63
-
Optional Miscellaneous Parameters
63
+
Optional miscellaneous parameters
64
64
'''''''''''''''''''''''''''''''''
65
65
66
66
- `nfolds <data-science/algo-params/nfolds.html>`__: Specify a value >= 2 for the number of folds for k-fold cross-validation of the models in the AutoML run or specify "-1" to let AutoML choose if k-fold cross-validation or blending mode should be used. Blending mode will use part of ``training_frame`` (if no ``blending_frame`` is provided) to train Stacked Ensembles. Use 0 to disable cross-validation; this will also disable Stacked Ensembles (thus decreasing the overall best model performance). This value defaults to "-1".
If the user turns off cross-validation by setting ``nfolds == 0``, then cross-validation metrics will not be available to populate the leaderboard. In this case, we need to make sure there is a holdout frame (i.e. the "leaderboard frame") to score the models on so that we can generate model performance metrics for the leaderboard. Without cross-validation, we will also require a validation frame to be used for early stopping on the models. Therefore, if either of these frames are not provided by the user, they will be automatically partitioned from the training data. If either frame is missing, 10% of the training data will be used to create a missing frame (if both are missing then a total of 20% of the training data will be used to create a 10% validation and 10% leaderboard frame).
149
149
150
-
XGBoost Memory Requirements
150
+
XGBoost memory requirements
151
151
'''''''''''''''''''''''''''
152
152
153
153
XGBoost, which is included in H2O as a third party library, requires its own memory outside the H2O (Java) cluster. When running AutoML with XGBoost (it is included by default), be sure you allow H2O no more than 2/3 of the total available RAM. Example: If you have 60G RAM, use ``h2o.init(max_mem_size = "40G")``, leaving 20G for XGBoost.
154
154
155
-
Scikit-learn Compatibility
155
+
Scikit-learn compatibility
156
156
''''''''''''''''''''''''''
157
157
158
158
``H2OAutoML`` can interact with the ``h2o.sklearn`` module. The ``h2o.sklearn`` module exposes 2 wrappers for ``H2OAutoML`` (``H2OAutoMLClassifier`` and ``H2OAutoMLRegressor``), which expose the standard API familiar to ``sklearn`` users: ``fit``, ``predict``, ``fit_predict``, ``score``, ``get_params``, and ``set_params``. It accepts various formats as input data (H2OFrame, ``numpy`` array, ``pandas`` Dataframe) which allows them to be combined with pure ``sklearn`` components in pipelines. For an example using ``H2OAutoML`` with the ``h2o.sklearn`` module, click `here <https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/sklearn-integration/H2OAutoML_as_sklearn_estimator.ipynb>`__.
@@ -164,7 +164,7 @@ Explainability
164
164
AutoML objects are fully supported though the `H2O Model Explainability <http://docs.h2o.ai/h2o/latest-stable/h2o-docs/explain.html>`__ interface. A large number of multi-model comparison and single model (AutoML leader) plots can be generated automatically with a single call to ``h2o.explain()``. We invite you to learn more at page linked above.
165
165
166
166
167
-
Code Examples
167
+
Code examples
168
168
-------------
169
169
170
170
Training
@@ -323,7 +323,7 @@ Using the previous code example, you can generate test set predictions as follow
323
323
preds = aml.leader.predict(test)
324
324
325
325
326
-
AutoML Output
326
+
AutoML output
327
327
-------------
328
328
329
329
Leaderboard
@@ -365,7 +365,7 @@ Here is an example of a leaderboard (with all columns) for a binary classificati
365
365
366
366
To create a leaderboard with metrics from a new ``leaderboard_frame`` `h2o.make_leaderboard <performance-and-prediction.html#leaderboard>`__ can be used.
367
367
368
-
Examine Models
368
+
Examine models
369
369
~~~~~~~~~~~~~~
370
370
371
371
To examine the trained models more closely, you can interact with the models, either by model ID, or a convenience function which can grab the best model of each model type (ranked by the default metric, or a metric of your choosing).
@@ -438,7 +438,7 @@ Once you have retreived the model in R or Python, you can inspect the model para
438
438
xgb.params['ntrees']
439
439
440
440
441
-
AutoML Log
441
+
AutoML log
442
442
~~~~~~~~~~
443
443
444
444
When using Python or R clients, you can also access meta information with the following AutoML object properties:
@@ -496,7 +496,7 @@ Below are a few screenhots of the app, though more visualizations are available
496
496
:align:center
497
497
498
498
499
-
Experimental Features
499
+
Experimental features
500
500
---------------------
501
501
502
502
Preprocessing
@@ -614,15 +614,15 @@ Information about how to cite the H2O software in general is covered in the `H2O
614
614
We would love to hear how you've used H2O AutoML,
615
615
so if you have a paper that references it, please let us know by opening an issue or submitting a PR to the `Awesome H2O repo <https://github.com/h2oai/awesome-h2o#research-papers>`__ on Github. This is the place that we keep track of papers that use H2O AutoML, and H2O generally.
616
616
617
-
Random Grid Search Parameters
617
+
Random grid search parameters
618
618
-----------------------------
619
619
620
620
AutoML performs a hyperparameter search over a variety of H2O algorithms in order to deliver the best model. In the table below, we list the hyperparameters, along with all potential values that can be randomly chosen in the search. If these models also have a non-default value set for a hyperparameter, we identify it in the list as well. Random Forest and Extremely Randomized Trees are not grid searched (in the current version of AutoML), so they are not included in the list below.
621
621
622
622
**Note**: AutoML does not run a standard grid search for GLM (returning all the possible models). Instead AutoML builds a single model with ``lambda_search`` enabled and passes a list of ``alpha`` values. It returns only the model with the best alpha-lambda combination rather than one model for each alpha-lambda combination.
623
623
624
624
625
-
GLM Hyperparameters
625
+
GLM hyperparameters
626
626
~~~~~~~~~~~~~~~~~~~
627
627
628
628
This table shows the GLM values that are searched over when performing AutoML grid search. Additional information is available `here <https://github.com/h2oai/h2o-3/blob/master/h2o-automl/src/main/java/ai/h2o/automl/modeling/GLMStepsProvider.java>`__.
@@ -636,7 +636,7 @@ This table shows the GLM values that are searched over when performing AutoML gr
This table shows the XGBoost values that are searched over when performing AutoML grid search. Additional information is available `here <https://github.com/h2oai/h2o-3/blob/master/h2o-automl/src/main/java/ai/h2o/automl/modeling/XGBoostSteps.java>`__.
@@ -664,7 +664,7 @@ This table shows the XGBoost values that are searched over when performing AutoM
This table shows the GLM values that are searched over when performing AutoML grid search. Additional information is available `here <https://github.com/h2oai/h2o-3/blob/master/h2o-automl/src/main/java/ai/h2o/automl/modeling/GBMStepsProvider.java>`__.
@@ -690,7 +690,7 @@ This table shows the GLM values that are searched over when performing AutoML gr
This table shows the Deep Learning values that are searched over when performing AutoML grid search. Additional information is available `here <https://github.com/h2oai/h2o-3/blob/master/h2o-automl/src/main/java/ai/h2o/automl/modeling/DeepLearningStepsProvider.java>`__.
@@ -718,7 +718,7 @@ This table shows the Deep Learning values that are searched over when performing
0 commit comments