Dealing with variable length inputs

Let's assume we are working with variable length inputs. One of the strongest parts in using `tf.data.Dataset` is the ability to pad batches as they come.

But since `scikit-learn`'s API is mainly focused around dataframes and arrays, incorporating this is kind of hard. Obviously, you can pad everything, but this can be a huge waste of memory. I'm trying to work with the `sklearn.pipeline.Pipeline` object, and I thought to myself "alright, I'll just create a custom transformer at the end of my pipeline just before the model, and make it return a `tf.data.Dataset` object to later plug in my model. But this is not possible since the `.transform` signature only accepts X and not y, while you'll need both to work with `tf.data.Dataset`.

So assume we have 4 features for each data point, and each has it's own sequence length, for example a datapoint might look like this:

```python3
sample_features = {'a': [1,2,3], 'b': [1,2,3,4,5], 'c': 1, 'd': [1,2]}
sample_label = 0
```

How will I be able to manage this kind of dataset under scikit learn + scikeras?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dealing with variable length inputs #160

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Dealing with variable length inputs #160

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions