You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/choosing-an-estimator.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -44,7 +44,7 @@ Regressors are a type of supervised learner that predict a continuous-valued out
44
44
|[SVR](regressors/svr.md)| High |||| Continuous |
45
45
46
46
## Clusterers
47
-
Clusterers are unsupervised learners that predict an integer-valued cluster number such as `0`, `1`, `...`, `n`. They are similar to classifiers, however since they lack a supervised training signal, they cannot be used to recognize or describe samples. Instead, clusterers differentiate and group samples using only the samples in a dataset. Clusterers that implement the [Probabilistic](probabilistic.md) interface can also output the probabilities that a sample belongs to a particular cluster.
47
+
Clusterers are unsupervised learners that predict an integer-valued cluster number such as `0`, `1`, `...`, `n`. They are similar to classifiers, however since they lack a supervised training signal, they cannot be used to recognize or describe samples. Instead, clusterers differentiate and group samples using only the samples in a dataset.
48
48
49
49
| Clusterer | Flexibility |[Proba](probabilistic.md)|[Online](online.md)|[Verbose](verbose.md)| Data Compatibility |
50
50
|---|---|---|---|---|---|
@@ -67,7 +67,7 @@ Anomaly Detectors are unsupervised learners that predict whether a sample should
67
67
|[Robust Z-Score](anomaly-detectors/robust-z-score.md)| Global | ● ||| Continuous |
68
68
69
69
## Model Flexibility Tradeoff
70
-
A characteristic of most estimator types is the notion of *flexibility*. Flexibility can be expressed in different ways but greater flexibility usually comes with the capacity to handle more complex tasks. The tradeoff for flexibility is increased computational complexity, reduced model interpretability, and greater susceptibility to [overfitting](cross-validation.md#overfitting). In contrast, inflexible models tend to be easier to interpret and quicker to train but are more prone to [underfitting](cross-validation.md#underfitting). In general, we recommend choosing the simplest model for your project that does not underfit the training data.
70
+
A characteristic of most estimator types is the notion of *flexibility*. Flexibility can be expressed in different ways but greater flexibility usually comes with the capacity to handle more complex tasks. The tradeoff for flexibility is increased computational complexity, reduced model interpretability, and greater susceptibility to [overfitting](cross-validation.md#overfitting). In contrast, low flexibility models tend to be easier to interpret and quicker to train but are more prone to [underfitting](cross-validation.md#underfitting). In general, we recommend choosing the simplest model that does not underfit the training data for your project.
71
71
72
72
## Meta-estimator Ensembles
73
73
Ensemble learning is when multiple estimators are used together to make the final prediction on a sample. Meta-estimator ensembles can consist of multiple variations of the same estimator or a heterogeneous mix of estimators of the same type. They generally work by the principal of averaging and can often achieve greater accuracy than a single estimator.
Sometimes, we might just want to transform a single column of the dataset. In the example below we use the `transformColumn()` method on the dataset to log transform a specified column.
51
+
Sometimes, we just want to transform a single column of the dataset. In the example below, we use the `transformColumn()` method on the dataset object to perform a log transformation to a specified column offset by passing it a callback function to apply to each value in the column.
52
52
53
53
```php
54
54
$dataset->transformColumn(6, 'log1p');
55
55
```
56
56
57
+
In the next example, we'll convert the `null` values of another column to a special placeholder class `?`.
58
+
59
+
```php
60
+
$dataset->transformColumn(9, function ($value) {
61
+
return $value === null ? '?' : $value;
62
+
});
63
+
```
64
+
57
65
## Standardization and Normalization
58
66
Oftentimes, the continuous features of a dataset will be on different scales because they were measured by different methods. For example, age (0 - 100) and income (0 - 9,999,999) are on two widely different scales. Standardization is the processes of transforming a dataset such that the features are all on one common scale. Normalization is the special case where the transformed features have a range between 0 and 1. Depending on the transformer, it may operate on the columns or the rows of the dataset.
|[One Hot Encoder](transformers/one-hot-encoder.md)| Categorical | Continuous | ● ||
@@ -78,7 +86,7 @@ Feature converters are transformers that convert feature columns of one data typ
78
86
## Dimensionality Reduction
79
87
Dimensionality reduction is a preprocessing technique for embedding a dataset into a lower dimensional vector space. It allows a learner to train and infer quicker by producing a dataset with fewer but more informative features.
@@ -88,24 +96,24 @@ Dimensionality reduction is a preprocessing technique for embedding a dataset in
88
96
## Feature Selection
89
97
Similarly to dimensionality reduction, feature selection aims to reduce the number of features in a dataset, however, feature selection seeks to keep the best features as-is and drop the less informative ones entirely. Adding feature selection can help speed up training and inference by creating a more parsimonious model. It can also improve the performance of the model by removing *noise* features and features that are uncorrelated with the outcome.
A technique for handling missing values in your dataset is a preprocessing step called *imputation*. Imputation is the process of replacing missing values with a pretty good guess.
|[KNN Imputer](transformers/knn-imputer.md)|Depends on distance kernel| ● ||
110
+
|[Missing Data Imputer](transformers/missing-data-imputer.md)|Categorical, Continuous| ● ||
111
+
|[Random Hot Deck Imputer](transformers/random-hot-deck-imputer.md)|Depends on distance kernel| ● ||
104
112
105
113
## Text Transformers
106
-
The library provides a number of transformers for natural language processing (NLP) and information retrieval (IR) such as those for text cleaning, normalization, and feature extraction from raw text blobs.
114
+
The library provides a number of transformers for natural language processing (NLP) and information retrieval (IR) tasks such as those for text cleaning, normalization, and feature extraction from raw text blobs.
[Pipeline](pipeline.md) meta-estimators help you automate a series of transformations. In addition, Pipeline objects are [Persistable](persistable.md) allowing you to save and load transformer fittings between processes. Whenever a dataset object is passed to a learner wrapped in a Pipeline, it will automatically be fitted and/or transformed before it arrives in the learner's context.
136
+
[Pipeline](pipeline.md) meta-estimators help you automate a series of transformations applied to the input dataset of an estimator. In addition, Pipeline objects are [Persistable](persistable.md) allowing you to save and load the transformer fittings between processes. Whenever a dataset object is passed to a learner wrapped in a Pipeline, depending on the operation, it will automatically be fitted and/or transformed before it arrives in the estimator's context.
129
137
130
138
Let's apply the same 3 transformers as in the example above by passing the transformer instances in the order we want them applied along with a base estimator to the constructor of Pipeline like in the example below.
131
139
@@ -242,4 +250,4 @@ If you ever want to preprocess a dataset and then save it for later you can do s
0 commit comments