Skip to content

Dealing with variable length inputs #160

Closed
@eliorc

Description

@eliorc

Let's assume we are working with variable length inputs. One of the strongest parts in using tf.data.Dataset is the ability to pad batches as they come.

But since scikit-learn's API is mainly focused around dataframes and arrays, incorporating this is kind of hard. Obviously, you can pad everything, but this can be a huge waste of memory. I'm trying to work with the sklearn.pipeline.Pipeline object, and I thought to myself "alright, I'll just create a custom transformer at the end of my pipeline just before the model, and make it return a tf.data.Dataset object to later plug in my model. But this is not possible since the .transform signature only accepts X and not y, while you'll need both to work with tf.data.Dataset.

So assume we have 4 features for each data point, and each has it's own sequence length, for example a datapoint might look like this:

sample_features = {'a': [1,2,3], 'b': [1,2,3,4,5], 'c': 1, 'd': [1,2]}
sample_label = 0

How will I be able to manage this kind of dataset under scikit learn + scikeras?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions