Skip to content

How to use get_params() #135

Open
Open
@koaning

Description

@koaning

I have a scikit-learn pipeline defined in the code below.

from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder, Binarizer
from sklearn.impute import SimpleImputer
from skrub import SelectCols
from sklearn.ensemble import HistGradientBoostingClassifier

feat_pipe = make_union(
    make_pipeline(
        SelectCols(["pclass", "sex"]),
        OneHotEncoder(sparse_output=False)
    ),
    SelectCols(["fare", "age"])
)

pipe = make_pipeline(
    feat_pipe, 
    HistGradientBoostingClassifier()
)

When I ask for the params of said pipeline I can see a long list of names that I can refer to when I do hyperparameter tuning.

pipe.get_params()

The list is long, but that is because it is nice and elaborate.

{'memory': None,
 'steps': [('featureunion',
   FeatureUnion(transformer_list=[('pipeline',
                                   Pipeline(steps=[('selectcols',
                                                    SelectCols(cols=['pclass',
                                                                     'sex'])),
                                                   ('onehotencoder',
                                                    OneHotEncoder(sparse_output=False))])),
                                  ('selectcols',
                                   SelectCols(cols=['fare', 'age']))])),
  ('histgradientboostingclassifier', HistGradientBoostingClassifier())],
 'verbose': False,
 'featureunion': FeatureUnion(transformer_list=[('pipeline',
                                 Pipeline(steps=[('selectcols',
                                                  SelectCols(cols=['pclass',
                                                                   'sex'])),
                                                 ('onehotencoder',
                                                  OneHotEncoder(sparse_output=False))])),
                                ('selectcols',
                                 SelectCols(cols=['fare', 'age']))]),
 'histgradientboostingclassifier': HistGradientBoostingClassifier(),
 'featureunion__n_jobs': None,
 'featureunion__transformer_list': [('pipeline',
   Pipeline(steps=[('selectcols', SelectCols(cols=['pclass', 'sex'])),
                   ('onehotencoder', OneHotEncoder(sparse_output=False))])),
  ('selectcols', SelectCols(cols=['fare', 'age']))],
 'featureunion__transformer_weights': None,
 'featureunion__verbose': False,
 'featureunion__verbose_feature_names_out': True,
 'featureunion__pipeline': Pipeline(steps=[('selectcols', SelectCols(cols=['pclass', 'sex'])),
                 ('onehotencoder', OneHotEncoder(sparse_output=False))]),
 'featureunion__selectcols': SelectCols(cols=['fare', 'age']),
 'featureunion__pipeline__memory': None,
 'featureunion__pipeline__steps': [('selectcols',
   SelectCols(cols=['pclass', 'sex'])),
  ('onehotencoder', OneHotEncoder(sparse_output=False))],
 'featureunion__pipeline__verbose': False,
 'featureunion__pipeline__selectcols': SelectCols(cols=['pclass', 'sex']),
 'featureunion__pipeline__onehotencoder': OneHotEncoder(sparse_output=False),
 'featureunion__pipeline__selectcols__cols': ['pclass', 'sex'],
 'featureunion__pipeline__onehotencoder__categories': 'auto',
 'featureunion__pipeline__onehotencoder__drop': None,
 'featureunion__pipeline__onehotencoder__dtype': numpy.float64,
 'featureunion__pipeline__onehotencoder__feature_name_combiner': 'concat',
 'featureunion__pipeline__onehotencoder__handle_unknown': 'error',
 'featureunion__pipeline__onehotencoder__max_categories': None,
 'featureunion__pipeline__onehotencoder__min_frequency': None,
 'featureunion__pipeline__onehotencoder__sparse_output': False,
 'featureunion__selectcols__cols': ['fare', 'age'],
 'histgradientboostingclassifier__categorical_features': 'warn',
 'histgradientboostingclassifier__class_weight': None,
 'histgradientboostingclassifier__early_stopping': 'auto',
 'histgradientboostingclassifier__interaction_cst': None,
 'histgradientboostingclassifier__l2_regularization': 0.0,
 'histgradientboostingclassifier__learning_rate': 0.1,
 'histgradientboostingclassifier__loss': 'log_loss',
 'histgradientboostingclassifier__max_bins': 255,
 'histgradientboostingclassifier__max_depth': None,
 'histgradientboostingclassifier__max_features': 1.0,
 'histgradientboostingclassifier__max_iter': 100,
 'histgradientboostingclassifier__max_leaf_nodes': 31,
 'histgradientboostingclassifier__min_samples_leaf': 20,
 'histgradientboostingclassifier__monotonic_cst': None,
 'histgradientboostingclassifier__n_iter_no_change': 10,
 'histgradientboostingclassifier__random_state': None,
 'histgradientboostingclassifier__scoring': 'loss',
 'histgradientboostingclassifier__tol': 1e-07,
 'histgradientboostingclassifier__validation_fraction': 0.1,
 'histgradientboostingclassifier__verbose': 0,
 'histgradientboostingclassifier__warm_start': False}

The reason why this is nice is that it allows me to be very specific. I can tune each input argument of every component like featureunion__pipeline__selectcols__cols or featureunion__pipeline__onehotencoder__sparse_output. This is very nice for gridsearch!

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    pipe, 
    param_grid={
        "featureunion__pipeline__onehotencoder__min_frequency": [None, 1, 5, 10]
    }
)

grid.fit(X, y)

The cool thing about this is that I am able to get a nice table as output too.

import pandas as pd

pd.DataFrame(grid.cv_results_).to_markdown()
mean_fit_time std_fit_time mean_score_time std_score_time param_featureunion__pipeline__onehotencoder__min_frequency params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.557284 0.0364319 0.0053968 0.00091813 nan {'featureunion__pipeline__onehotencoder__min_frequency': None} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1
1 0.567849 0.0222483 0.00532556 0.000495336 1 {'featureunion__pipeline__onehotencoder__min_frequency': 1} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1
2 0.567496 0.00920872 0.00557318 0.000404766 5 {'featureunion__pipeline__onehotencoder__min_frequency': 5} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1
3 0.553523 0.023475 0.0052145 0.000855578 10 {'featureunion__pipeline__onehotencoder__min_frequency': 10} 0.515267 0.774809 0.637405 0.709924 0.636015 0.654684 0.0866783 1

But when I look at IbisML I wonder if I am able to do the same thing.

import ibis_ml as iml

tfm = iml.Recipe(
    iml.ExpandDateTime(iml.date())
)

In IbisML it is the Recipe object that is scikit-learn compatible, not the ExpandDateTime object. So lets inspect.

tfm.get_params()

This yields the following.

{'steps': (ExpandDateTime(date(),
                 components=['dow', 'month', 'year', 'hour', 'minute', 'second']),),
 'expanddatetime': ExpandDateTime(date(),
                components=['dow', 'month', 'year', 'hour', 'minute', 'second'])}

In fairness, this is not completely unlike what scikit-learn does natively. In a pipeline in scikit-learn you also have access to the steps argument and you could theoretically make all the changes there directly by passing in new subpipelines. But there is a reason why scikit-learn does not stop there! It can go deeper into all the input arguments of all the estimators in the pipeline because it makes the end cv_results_ output a lot nicer. And this is where I worry if IbisML can do the same thing. It seems that I need to pass full objects, instead of being able to pluck out the individual attributes that I care about.

In this particular case, what if I want to measure the effect of including/excluding dow or the hour? Is that possible? Can I have an understore-syntax-like string just like in scikit-learn to configure that? Or do I need to overwrite the steps object?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

  • Status

    backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions