Skip to content

Commit 67c6222

Browse files
committed
Improve the docs
1 parent 3442129 commit 67c6222

2 files changed

Lines changed: 32 additions & 24 deletions

File tree

docs/choosing-an-estimator.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ Regressors are a type of supervised learner that predict a continuous-valued out
4444
| [SVR](regressors/svr.md) | High | | | | Continuous |
4545

4646
## Clusterers
47-
Clusterers are unsupervised learners that predict an integer-valued cluster number such as `0`, `1`, `...`, `n`. They are similar to classifiers, however since they lack a supervised training signal, they cannot be used to recognize or describe samples. Instead, clusterers differentiate and group samples using only the samples in a dataset. Clusterers that implement the [Probabilistic](probabilistic.md) interface can also output the probabilities that a sample belongs to a particular cluster.
47+
Clusterers are unsupervised learners that predict an integer-valued cluster number such as `0`, `1`, `...`, `n`. They are similar to classifiers, however since they lack a supervised training signal, they cannot be used to recognize or describe samples. Instead, clusterers differentiate and group samples using only the samples in a dataset.
4848

4949
| Clusterer | Flexibility | [Proba](probabilistic.md) | [Online](online.md) | [Verbose](verbose.md) | Data Compatibility |
5050
|---|---|---|---|---|---|
@@ -67,7 +67,7 @@ Anomaly Detectors are unsupervised learners that predict whether a sample should
6767
| [Robust Z-Score](anomaly-detectors/robust-z-score.md) | Global || | | Continuous |
6868

6969
## Model Flexibility Tradeoff
70-
A characteristic of most estimator types is the notion of *flexibility*. Flexibility can be expressed in different ways but greater flexibility usually comes with the capacity to handle more complex tasks. The tradeoff for flexibility is increased computational complexity, reduced model interpretability, and greater susceptibility to [overfitting](cross-validation.md#overfitting). In contrast, inflexible models tend to be easier to interpret and quicker to train but are more prone to [underfitting](cross-validation.md#underfitting). In general, we recommend choosing the simplest model for your project that does not underfit the training data.
70+
A characteristic of most estimator types is the notion of *flexibility*. Flexibility can be expressed in different ways but greater flexibility usually comes with the capacity to handle more complex tasks. The tradeoff for flexibility is increased computational complexity, reduced model interpretability, and greater susceptibility to [overfitting](cross-validation.md#overfitting). In contrast, low flexibility models tend to be easier to interpret and quicker to train but are more prone to [underfitting](cross-validation.md#underfitting). In general, we recommend choosing the simplest model that does not underfit the training data for your project.
7171

7272
## Meta-estimator Ensembles
7373
Ensemble learning is when multiple estimators are used together to make the final prediction on a sample. Meta-estimator ensembles can consist of multiple variations of the same estimator or a heterogeneous mix of estimators of the same type. They generally work by the principal of averaging and can often achieve greater accuracy than a single estimator.

docs/preprocessing.md

Lines changed: 30 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -48,28 +48,36 @@ $transformer->update($dataset);
4848
```
4949

5050
## Transform a Single Column
51-
Sometimes, we might just want to transform a single column of the dataset. In the example below we use the `transformColumn()` method on the dataset to log transform a specified column.
51+
Sometimes, we just want to transform a single column of the dataset. In the example below, we use the `transformColumn()` method on the dataset object to perform a log transformation to a specified column offset by passing it a callback function to apply to each value in the column.
5252

5353
```php
5454
$dataset->transformColumn(6, 'log1p');
5555
```
5656

57+
In the next example, we'll convert the `null` values of another column to a special placeholder class `?`.
58+
59+
```php
60+
$dataset->transformColumn(9, function ($value) {
61+
return $value === null ? '?' : $value;
62+
});
63+
```
64+
5765
## Standardization and Normalization
5866
Oftentimes, the continuous features of a dataset will be on different scales because they were measured by different methods. For example, age (0 - 100) and income (0 - 9,999,999) are on two widely different scales. Standardization is the processes of transforming a dataset such that the features are all on one common scale. Normalization is the special case where the transformed features have a range between 0 and 1. Depending on the transformer, it may operate on the columns or the rows of the dataset.
5967

60-
| Transformer | Operates On | Range | Stateful | Elastic |
68+
| Transformer | Operates | Output Range | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) |
6169
|---|---|---|---|---|
62-
| [L1 Normalizer](transformers/l1-normalizer.md) | Rows | [0, 1] | | |
63-
| [L2 Normalizer](transformers/l2-normalizer.md) | Rows | [0, 1] | | |
64-
| [Max Absolute Scaler](transformers/max-absolute-scaler.md) | Columns | [-1, 1] |||
65-
| [Min Max Normalizer](transformers/min-max-normalizer.md) | Columns | [min, max] |||
66-
| [Robust Standardizer](transformers/robust-standardizer.md) | Columns | [-∞, ∞] || |
67-
| [Z Scale Standardizer](transformers/z-scale-standardizer.md) | Columns | [-∞, ∞] |||
70+
| [L1 Normalizer](transformers/l1-normalizer.md) | Row-wise | [0, 1] | | |
71+
| [L2 Normalizer](transformers/l2-normalizer.md) | Row-wise | [0, 1] | | |
72+
| [Max Absolute Scaler](transformers/max-absolute-scaler.md) | Column-wise | [-1, 1] |||
73+
| [Min Max Normalizer](transformers/min-max-normalizer.md) | Column-wise | [min, max] |||
74+
| [Robust Standardizer](transformers/robust-standardizer.md) | Column-wise | [-∞, ∞] || |
75+
| [Z Scale Standardizer](transformers/z-scale-standardizer.md) | Column-wise | [-∞, ∞] |||
6876

6977
## Feature Conversion
7078
Feature converters are transformers that convert feature columns of one data type to another by changing their representation.
7179

72-
| Transformer | From | To | Stateful | Elastic |
80+
| Transformer | From | To | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) |
7381
|---|---|---|---|---|
7482
| [Interval Discretizer](transformers/interval-discretizer.md) | Continuous | Categorical || |
7583
| [One Hot Encoder](transformers/one-hot-encoder.md) | Categorical | Continuous || |
@@ -78,7 +86,7 @@ Feature converters are transformers that convert feature columns of one data typ
7886
## Dimensionality Reduction
7987
Dimensionality reduction is a preprocessing technique for embedding a dataset into a lower dimensional vector space. It allows a learner to train and infer quicker by producing a dataset with fewer but more informative features.
8088

81-
| Transformer | Supervised | Stateful | Elastic |
89+
| Transformer | Supervised | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) |
8290
|---|---|---|---|
8391
| [Gaussian Random Projector](transformers/gaussian-random-projector.md) | || |
8492
| [Linear Discriminant Analysis](transformers/linear-discriminant-analysis.md) ||| |
@@ -88,24 +96,24 @@ Dimensionality reduction is a preprocessing technique for embedding a dataset in
8896
## Feature Selection
8997
Similarly to dimensionality reduction, feature selection aims to reduce the number of features in a dataset, however, feature selection seeks to keep the best features as-is and drop the less informative ones entirely. Adding feature selection can help speed up training and inference by creating a more parsimonious model. It can also improve the performance of the model by removing *noise* features and features that are uncorrelated with the outcome.
9098

91-
| Transformer | Supervised | Stateful | Elastic |
99+
| Transformer | Supervised | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) |
92100
|---|---|---|---|
93101
| [K Best Feature Selector](transformers/k-best-feature-selector.md) ||| |
94102
| [Recursive Feature Eliminator](transformers/recursive-feature-eliminator.md) ||| |
95103

96104
## Imputation
97105
A technique for handling missing values in your dataset is a preprocessing step called *imputation*. Imputation is the process of replacing missing values with a pretty good guess.
98106

99-
| Transformer | Continuous | Categorical | Stateful | Elastic |
100-
|---|---|---|---|---|
101-
| [KNN Imputer](transformers/knn-imputer.md) | | || |
102-
| [Missing Data Imputer](transformers/missing-data-imputer.md) | | || |
103-
| [Random Hot Deck Imputer](transformers/random-hot-deck-imputer.md) | | || |
107+
| Transformer | Data Compatibility | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) |
108+
|---|---|---|---|
109+
| [KNN Imputer](transformers/knn-imputer.md) | Depends on distance kernel || |
110+
| [Missing Data Imputer](transformers/missing-data-imputer.md) | Categorical, Continuous || |
111+
| [Random Hot Deck Imputer](transformers/random-hot-deck-imputer.md) | Depends on distance kernel || |
104112

105113
## Text Transformers
106-
The library provides a number of transformers for natural language processing (NLP) and information retrieval (IR) such as those for text cleaning, normalization, and feature extraction from raw text blobs.
114+
The library provides a number of transformers for natural language processing (NLP) and information retrieval (IR) tasks such as those for text cleaning, normalization, and feature extraction from raw text blobs.
107115

108-
| Transformer | Stateful | Elastic |
116+
| Transformer | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) |
109117
|---|---|---|
110118
| [HTML Stripper](transformers/html-stripper.md) | | |
111119
| [Regex Filter](transformers/regex-filter.md) | | |
@@ -117,15 +125,15 @@ The library provides a number of transformers for natural language processing (N
117125
| [Word Count Vectorizer](transformers/word-count-vectorizer.md) || |
118126

119127
## Image Transformers
120-
Since image have their own high-level data type, they can be preprocessed in a dataset by applying any number of image transformers.
128+
These transformers operate on the high-level image data type.
121129

122-
| Transformer | Stateful | Elastic |
130+
| Transformer | [Stateful](transformers/api.md#stateful) | [Elastic](transformers/api.md#elastic) |
123131
|---|---|---|
124132
| [Image Resizer](transformers/image-resizer.md) | | |
125133
| [Image Vectorizer](transformers/image-vectorizer.md) || |
126134

127135
## Transformer Pipelines
128-
[Pipeline](pipeline.md) meta-estimators help you automate a series of transformations. In addition, Pipeline objects are [Persistable](persistable.md) allowing you to save and load transformer fittings between processes. Whenever a dataset object is passed to a learner wrapped in a Pipeline, it will automatically be fitted and/or transformed before it arrives in the learner's context.
136+
[Pipeline](pipeline.md) meta-estimators help you automate a series of transformations applied to the input dataset of an estimator. In addition, Pipeline objects are [Persistable](persistable.md) allowing you to save and load the transformer fittings between processes. Whenever a dataset object is passed to a learner wrapped in a Pipeline, depending on the operation, it will automatically be fitted and/or transformed before it arrives in the estimator's context.
129137

130138
Let's apply the same 3 transformers as in the example above by passing the transformer instances in the order we want them applied along with a base estimator to the constructor of Pipeline like in the example below.
131139

@@ -242,4 +250,4 @@ If you ever want to preprocess a dataset and then save it for later you can do s
242250
use Rubix\ML\Transformers\MissingDataImputer;
243251

244252
$dataset->apply(new MissingDataImputer())->toCSV()->write('dataset.csv');
245-
```
253+
```

0 commit comments

Comments
 (0)