-
Notifications
You must be signed in to change notification settings - Fork 171
Pipelines can't be nested as any other learner #199
Description
Code sample
This script is complete, it should run "as is"
from fklearn.training.regression import linear_regression_learner
from fklearn.training.pipeline import build_pipeline
from fklearn.training.imputation import imputer
# This example will try to create a linear regressor...
def load_sample_data():
import pandas as pd
from sklearn import datasets
x,y = datasets.load_diabetes(as_frame=True, return_X_y=True)
return pd.concat([x, y], axis=1)
# ...but pipelines can't be used as individual learners
nested_learner = build_pipeline(
build_pipeline(
imputer(columns_to_impute=["age"], impute_strategy='median'),
),
linear_regression_learner(features=["age"], target=["target"])
)
fn, df_result, logs = nested_learner(load_sample_data())
'''
Output:
ValueError: Predict function of learner pipeline contains variable length arguments: kwargs
Make sure no predict function uses arguments like *args or **kwargs.
'''Problem description
This is a problem because (taken from the docs) "pipelines (should) behave exactly as individual learner functions". That is, pipelines should be consistent with L in SOLID, but that is not happening.
The main benefit of supporting nested pipelines is that you can produce more maintainable code as packing complex operations in one big step like:
def custom_learner(feature: str, target: str):
return build_pipeline(
imputer(columns_to_impute=[feature], impute_strategy='median'),
linear_regression_learner(features=[feature], target=[target])
)Is cleaner and more readable.
Expected behavior
The "Code sample" should produce the same output as if nested_learner was defined as:
nested_learner = build_pipeline(
imputer(columns_to_impute=["age"], impute_strategy='median'),
linear_regression_learner(features=["age"], target=["target"])
)That is, if a pipeline is a type of learner, it should be possible to put it in place of any other learner.
Possible solutions
A workaround is proposed in #145 , but it only works if you already have a DataFrame (which doesn't happen in this scenario). Would like to hear if someone has investigated this or knows what we need to change to support this, but in the meantime I propose adding a note to the docs to prevent someone else from trying to do this.