Description
I have a scikit-learn pipeline defined in the code below.
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder, Binarizer
from sklearn.impute import SimpleImputer
from skrub import SelectCols
from sklearn.ensemble import HistGradientBoostingClassifier
feat_pipe = make_union(
make_pipeline(
SelectCols(["pclass", "sex"]),
OneHotEncoder(sparse_output=False)
),
SelectCols(["fare", "age"])
)
pipe = make_pipeline(
feat_pipe,
HistGradientBoostingClassifier()
)
When I ask for the params of said pipeline I can see a long list of names that I can refer to when I do hyperparameter tuning.
pipe.get_params()
The list is long, but that is because it is nice and elaborate.
{'memory': None,
'steps': [('featureunion',
FeatureUnion(transformer_list=[('pipeline',
Pipeline(steps=[('selectcols',
SelectCols(cols=['pclass',
'sex'])),
('onehotencoder',
OneHotEncoder(sparse_output=False))])),
('selectcols',
SelectCols(cols=['fare', 'age']))])),
('histgradientboostingclassifier', HistGradientBoostingClassifier())],
'verbose': False,
'featureunion': FeatureUnion(transformer_list=[('pipeline',
Pipeline(steps=[('selectcols',
SelectCols(cols=['pclass',
'sex'])),
('onehotencoder',
OneHotEncoder(sparse_output=False))])),
('selectcols',
SelectCols(cols=['fare', 'age']))]),
'histgradientboostingclassifier': HistGradientBoostingClassifier(),
'featureunion__n_jobs': None,
'featureunion__transformer_list': [('pipeline',
Pipeline(steps=[('selectcols', SelectCols(cols=['pclass', 'sex'])),
('onehotencoder', OneHotEncoder(sparse_output=False))])),
('selectcols', SelectCols(cols=['fare', 'age']))],
'featureunion__transformer_weights': None,
'featureunion__verbose': False,
'featureunion__verbose_feature_names_out': True,
'featureunion__pipeline': Pipeline(steps=[('selectcols', SelectCols(cols=['pclass', 'sex'])),
('onehotencoder', OneHotEncoder(sparse_output=False))]),
'featureunion__selectcols': SelectCols(cols=['fare', 'age']),
'featureunion__pipeline__memory': None,
'featureunion__pipeline__steps': [('selectcols',
SelectCols(cols=['pclass', 'sex'])),
('onehotencoder', OneHotEncoder(sparse_output=False))],
'featureunion__pipeline__verbose': False,
'featureunion__pipeline__selectcols': SelectCols(cols=['pclass', 'sex']),
'featureunion__pipeline__onehotencoder': OneHotEncoder(sparse_output=False),
'featureunion__pipeline__selectcols__cols': ['pclass', 'sex'],
'featureunion__pipeline__onehotencoder__categories': 'auto',
'featureunion__pipeline__onehotencoder__drop': None,
'featureunion__pipeline__onehotencoder__dtype': numpy.float64,
'featureunion__pipeline__onehotencoder__feature_name_combiner': 'concat',
'featureunion__pipeline__onehotencoder__handle_unknown': 'error',
'featureunion__pipeline__onehotencoder__max_categories': None,
'featureunion__pipeline__onehotencoder__min_frequency': None,
'featureunion__pipeline__onehotencoder__sparse_output': False,
'featureunion__selectcols__cols': ['fare', 'age'],
'histgradientboostingclassifier__categorical_features': 'warn',
'histgradientboostingclassifier__class_weight': None,
'histgradientboostingclassifier__early_stopping': 'auto',
'histgradientboostingclassifier__interaction_cst': None,
'histgradientboostingclassifier__l2_regularization': 0.0,
'histgradientboostingclassifier__learning_rate': 0.1,
'histgradientboostingclassifier__loss': 'log_loss',
'histgradientboostingclassifier__max_bins': 255,
'histgradientboostingclassifier__max_depth': None,
'histgradientboostingclassifier__max_features': 1.0,
'histgradientboostingclassifier__max_iter': 100,
'histgradientboostingclassifier__max_leaf_nodes': 31,
'histgradientboostingclassifier__min_samples_leaf': 20,
'histgradientboostingclassifier__monotonic_cst': None,
'histgradientboostingclassifier__n_iter_no_change': 10,
'histgradientboostingclassifier__random_state': None,
'histgradientboostingclassifier__scoring': 'loss',
'histgradientboostingclassifier__tol': 1e-07,
'histgradientboostingclassifier__validation_fraction': 0.1,
'histgradientboostingclassifier__verbose': 0,
'histgradientboostingclassifier__warm_start': False}
The reason why this is nice is that it allows me to be very specific. I can tune each input argument of every component like featureunion__pipeline__selectcols__cols
or featureunion__pipeline__onehotencoder__sparse_output
. This is very nice for gridsearch!
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(
pipe,
param_grid={
"featureunion__pipeline__onehotencoder__min_frequency": [None, 1, 5, 10]
}
)
grid.fit(X, y)
The cool thing about this is that I am able to get a nice table as output too.
import pandas as pd
pd.DataFrame(grid.cv_results_).to_markdown()
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_featureunion__pipeline__onehotencoder__min_frequency | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.557284 | 0.0364319 | 0.0053968 | 0.00091813 | nan | {'featureunion__pipeline__onehotencoder__min_frequency': None} | 0.515267 | 0.774809 | 0.637405 | 0.709924 | 0.636015 | 0.654684 | 0.0866783 | 1 |
1 | 0.567849 | 0.0222483 | 0.00532556 | 0.000495336 | 1 | {'featureunion__pipeline__onehotencoder__min_frequency': 1} | 0.515267 | 0.774809 | 0.637405 | 0.709924 | 0.636015 | 0.654684 | 0.0866783 | 1 |
2 | 0.567496 | 0.00920872 | 0.00557318 | 0.000404766 | 5 | {'featureunion__pipeline__onehotencoder__min_frequency': 5} | 0.515267 | 0.774809 | 0.637405 | 0.709924 | 0.636015 | 0.654684 | 0.0866783 | 1 |
3 | 0.553523 | 0.023475 | 0.0052145 | 0.000855578 | 10 | {'featureunion__pipeline__onehotencoder__min_frequency': 10} | 0.515267 | 0.774809 | 0.637405 | 0.709924 | 0.636015 | 0.654684 | 0.0866783 | 1 |
But when I look at IbisML I wonder if I am able to do the same thing.
import ibis_ml as iml
tfm = iml.Recipe(
iml.ExpandDateTime(iml.date())
)
In IbisML it is the Recipe
object that is scikit-learn compatible, not the ExpandDateTime
object. So lets inspect.
tfm.get_params()
This yields the following.
{'steps': (ExpandDateTime(date(),
components=['dow', 'month', 'year', 'hour', 'minute', 'second']),),
'expanddatetime': ExpandDateTime(date(),
components=['dow', 'month', 'year', 'hour', 'minute', 'second'])}
In fairness, this is not completely unlike what scikit-learn does natively. In a pipeline in scikit-learn you also have access to the steps
argument and you could theoretically make all the changes there directly by passing in new subpipelines. But there is a reason why scikit-learn does not stop there! It can go deeper into all the input arguments of all the estimators in the pipeline because it makes the end cv_results_
output a lot nicer. And this is where I worry if IbisML can do the same thing. It seems that I need to pass full objects, instead of being able to pluck out the individual attributes that I care about.
In this particular case, what if I want to measure the effect of including/excluding dow
or the hour
? Is that possible? Can I have an understore-syntax-like string just like in scikit-learn to configure that? Or do I need to overwrite the steps object?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
backlog