Skip to content

Pipelines can't be nested as any other learner #199

@peguerosdc

Description

@peguerosdc

Code sample

This script is complete, it should run "as is"

from fklearn.training.regression import linear_regression_learner
from fklearn.training.pipeline import build_pipeline
from fklearn.training.imputation import imputer

# This example will try to create a linear regressor...
def load_sample_data():
    import pandas as pd
    from sklearn import datasets
    x,y = datasets.load_diabetes(as_frame=True, return_X_y=True)
    return pd.concat([x, y], axis=1)

# ...but pipelines can't be used as individual learners
nested_learner = build_pipeline(
    build_pipeline(
        imputer(columns_to_impute=["age"], impute_strategy='median'),
    ),
    linear_regression_learner(features=["age"], target=["target"])
)
fn, df_result, logs = nested_learner(load_sample_data())
'''
Output:
ValueError: Predict function of learner pipeline contains variable length arguments: kwargs
Make sure no predict function uses arguments like *args or **kwargs.
'''

Problem description

This is a problem because (taken from the docs) "pipelines (should) behave exactly as individual learner functions". That is, pipelines should be consistent with L in SOLID, but that is not happening.

The main benefit of supporting nested pipelines is that you can produce more maintainable code as packing complex operations in one big step like:

def custom_learner(feature: str, target: str):
    return build_pipeline(
        imputer(columns_to_impute=[feature], impute_strategy='median'),
        linear_regression_learner(features=[feature], target=[target])
    )

Is cleaner and more readable.

Expected behavior

The "Code sample" should produce the same output as if nested_learner was defined as:

nested_learner = build_pipeline(
    imputer(columns_to_impute=["age"], impute_strategy='median'),
    linear_regression_learner(features=["age"], target=["target"])
)

That is, if a pipeline is a type of learner, it should be possible to put it in place of any other learner.

Possible solutions

A workaround is proposed in #145 , but it only works if you already have a DataFrame (which doesn't happen in this scenario). Would like to hear if someone has investigated this or knows what we need to change to support this, but in the meantime I propose adding a note to the docs to prevent someone else from trying to do this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions