Description
Let's assume we are working with variable length inputs. One of the strongest parts in using tf.data.Dataset
is the ability to pad batches as they come.
But since scikit-learn
's API is mainly focused around dataframes and arrays, incorporating this is kind of hard. Obviously, you can pad everything, but this can be a huge waste of memory. I'm trying to work with the sklearn.pipeline.Pipeline
object, and I thought to myself "alright, I'll just create a custom transformer at the end of my pipeline just before the model, and make it return a tf.data.Dataset
object to later plug in my model. But this is not possible since the .transform
signature only accepts X and not y, while you'll need both to work with tf.data.Dataset
.
So assume we have 4 features for each data point, and each has it's own sequence length, for example a datapoint might look like this:
sample_features = {'a': [1,2,3], 'b': [1,2,3,4,5], 'c': 1, 'd': [1,2]}
sample_label = 0
How will I be able to manage this kind of dataset under scikit learn + scikeras?